Optimization of Parameter Values for Machine-Learned Models

ABSTRACT

The present disclosure provides computing systems and associated methods for optimizing one or more adjustable parameters (e.g. operating parameters) of a system. In particular, the present disclosure provides a parameter optimization system that can perform one or more black-box optimization techniques to iteratively suggest new sets of parameter values for evaluation. The iterative suggestion and evaluation process can serve to optimize or otherwise improve the overall performance of the system, as evaluated by an objective function that evaluates one or more metrics. The present disclosure also provides a novel black-box optimization technique known as “Gradientless Descent” that is more clever and faster than random search yet retains most of random search&#39;s favorable qualities.

FIELD

The present disclosure relates generally to black-box optimization. Moreparticularly, the present disclosure relates to systems that performblack-box optimization (e.g., as a service) and to a novel black-boxoptimization technique.

BACKGROUND

A system can include a number of adjustable parameters that affect thequality, performance, and/or outcome of the system. Identifyingparameter values that optimize the performance of the system (e.g., ingeneral or for a particular application or user group) can bechallenging, particularly when the system is complex (e.g., challengingto model) or includes a significant number of adjustable parameters.

In particular, any sufficiently complex system acts as a black-box whenit becomes easier to experiment with than to understand. Hence,black-box optimization has become increasingly important as systems havebecome more complex.

Black-box optimization can include the task of optimizing an objectivefunction ƒ:X→

with a limited budget for evaluations. The adjective “black-box” meansthat while ƒ(x) can be evaluated for any x∈X, any other informationabout ƒ, such as gradients or the Hessian, is not generally known. Whenfunction evaluations are expensive, it is desirable to carefully andadaptively select values to evaluate. Thus, an overall goal of ablack-box optimization technique can be to generate a sequence of x_(t)that approaches the global optimum as rapidly as possible.

SUMMARY

Aspects and advantages of embodiments of the present disclosure will beset forth in part in the following description, or can be learned fromthe description, or can be learned through practice of the embodiments.

One aspect of the present disclosure is directed to acomputer-implemented method for use in optimization of parameters of asystem, product, or process. The method includes establishing, by one ormore computing devices, an optimization procedure for a system, product,or process. The system, product, or process has an evaluable performancethat is dependent on values of one or more adjustable parameters. Themethod includes receiving, by the one or more computing devices, one ormore prior evaluations of performance of the system, product, orprocess. The one or more prior evaluations are respectively associatedwith one or more prior variants of the system, product, or process. Theone or more prior variants are each defined by a set of values for theone or more adjustable parameters. The method includes utilizing, by theone or more computing devices, an optimization algorithm to generate asuggested variant based at least in part on the one or more priorevaluations of performance and the associated set of values. Thesuggested variant is defined by a suggested set of values for the one ormore adjustable parameters. The method includes receiving, by the one ormore computing devices, one or more intermediate evaluations ofperformance of the suggested variant. The intermediate evaluations havebeen obtained from on an ongoing evaluation of the suggested variant.The method includes performing, by the one or more computing devices,non-parametric regression, based on the intermediate evaluations and theprior evaluations, to determine whether to perform early-stopping of theongoing evaluation of the suggested variant. The method includes, inresponse to determining that early-stopping is to be performed, causing,by the one or more computing devices, early-stopping to be performed inrespect of the ongoing evaluation or providing an indication thatearly-stopping should be performed.

Performing, by the one or more computing devices, non-parametricregression to determine whether to perform early-stopping of the ongoingevaluation of the suggested variant can include determining, by the oneor more computing devices based on the non-parametric regression, aprobability of a final performance of the suggested variant exceeding acurrent best performance as indicated by one of the prior evaluations ofperformance of a prior variant. Performing, by the one or more computingdevices, non-parametric regression to determine whether to performearly-stopping of the ongoing evaluation of the suggested variant caninclude determining, by the one or more computing devices, whether toperform early-stopping of the ongoing evaluation based on a comparisonof the determined probability with a threshold.

Performing, by the one or more computing devices, non-parametricregression to determine whether to perform early-stopping of the ongoingevaluation of the suggested variant can include measuring, by the one ormore computing devices, a similarity between a performance curve that isbased on the intermediate evaluations and a performance curvecorresponding to performance of a current best variant that is based onthe prior evaluation for the current best variant.

The computer-implemented method can further include performing, by theone or more computing devices, transfer learning to obtain initialvalues for the one or more adjustable parameters. Performing, by the oneor more computing devices, transfer learning can include identifying, bythe one or more computing devices, a plurality of prior optimizationprocedures, the plurality of prior optimization procedures organized ina sequence. Performing, by the one or more computing devices, transferlearning can include building, by the one or more computing devices, aplurality of Gaussian Process regressors respectively for the pluralityof prior optimization procedures. The Gaussian Process regressor foreach prior optimization procedure is trained on one or more residualsrelative to the Gaussian Process regressor for the previous prioroptimization procedure in the sequence.

Another aspect of the present disclosure is directed to a computersystem operable to suggest trial parameters. The computer systemincludes a database that stores one or more results respectivelyassociated with one or more trials of a study. The one or more trialsfor the study respectively include one or more sets of values for one ormore adjustable parameters associated with the study. The result foreach trial includes an evaluation of the corresponding set of values forthe one or more adjustable parameters. The computer system includes oneor more processors and one or more non-transitory computer-readablemedia that store instructions that, when executed by the one or moreprocessors, cause the computer system to perform operations. Theoperations include performing one or more black-box optimizationtechniques to generate a suggested trial based at least in part on theone or more results and the one or more sets of values respectivelyassociated with the one or more results. The suggested trial includes asuggested set of values for the one or more adjustable parameters. Theoperations include accepting an adjustment to the suggested trial from auser. The adjustment includes at least one change to the suggested setof values to form an adjusted set of values. The operations includereceiving a new result obtained through evaluation of the adjusted setof values. The operations include associating the new result and theadjusted set of values with the study in the database.

The operations can further include generating a second suggested trialbased at least in part on the new result for the adjusted set of values,the second suggested trial including a second suggested set of valuesfor the one or more adjustable parameters.

The operations can further include performing a plurality of rounds ofgeneration of suggested trials using at least two different black-boxoptimization techniques.

The operations can further include automatically and dynamicallychanging black-box optimization techniques between at least two of theplurality of rounds of generation of suggested trials.

The one or more black-box optimization techniques can be stateless so asto enable switching between black-box optimization techniques during thestudy.

The operations can further include performing a plurality of rounds ofgeneration of suggested trials. The operations can further includereceiving a change to a feasible set of values for at least one of theone or more adjustable parameters between at least two of the pluralityof rounds of generation of suggested trials.

The operations can further include receiving a plurality of requests foradditional suggested trials for the study. The operations can furtherinclude batching at least a portion of the plurality of requeststogether. The operations can further include generating, as a batch, theadditional suggested trials in response to the plurality of requests.

The operations can further include receiving intermediate statisticsassociated with an ongoing trial. The operations can further includeperforming non-parametric regression on the intermediate statistics todetermine whether to perform early stopping of the ongoing trial.

The operations can further include performing transfer learning toobtain initial values for the one or more adjustable parameters.Performing transfer learning can include identifying a plurality ofstudies. The plurality of studies can be organized in a sequence.Performing transfer learning can include building a plurality ofGaussian Process regressors respectively for the plurality of studies.The Gaussian Process regressor for each study can be trained on one ormore residuals relative to the Gaussian Process regressor for theprevious study in the sequence.

The operations can further include providing for display a parallelcoordinates visualization of the one or more results and the one or moresets of values for the one or more adjustable parameters.

Another aspect of the present disclosure is directed to acomputer-implemented method to suggest trial parameters. The methodincludes establishing, by one or more computing devices, a study thatincludes one or more adjustable parameters. The method includesreceiving, by the one or more computing devices, one or more resultsrespectively associated with one or more trials of the study. The one ormore trials respectively include one or more sets of values for the oneor more adjustable parameters. The result for each trial includes anevaluation of the corresponding set of values for the one or moreadjustable parameters. The method includes generating, by the one ormore computing devices, a suggested trial based at least in part on theone or more results and the one or more sets of values. The suggestedtrial includes a suggested set of values for the one or more adjustableparameters. The method includes receiving, by the one or more computingdevices, an adjustment to the suggested trial from a user. Theadjustment includes at least one change to the suggested set of valuesto form an adjusted set of values. The method includes receiving, by theone or more computing devices, a new result associated with the adjustedset of values. The method includes associating, by the one or morecomputing devices, the new result and the adjusted set of values withthe study.

The method can further include generating, by the one or more computingdevices, a second suggested trial based at least in part on the newresult for the adjusted set of values. The second suggested trial caninclude a second suggested set of values for the one or more adjustableparameters.

Generating, by the one or more computing devices, the suggested trialcan include performing, by the one or more computing devices, a firstblack-box optimization technique to generate the suggested trial basedat least in part on the one or more results and the one or more sets ofvalues. Generating, by the one or more computing devices, the secondsuggested trial can include performing, by the one or more computingdevices, a second black-box optimization technique to generate thesecond suggested trial based at least in part on the new result for theadjusted set of values. The second black-box optimization technique canbe different from the first black-box optimization technique.

The method can further include, prior to performing, by the one or morecomputing devices, the second black-box optimization technique togenerate the second suggested trial, receiving, by the one or morecomputing devices, a user input that selects the second black-boxoptimization technique from a plurality of available black-boxoptimization techniques.

The method can further include, prior to performing, by the one or morecomputing devices, the second black-box optimization technique togenerate the second suggested trial, automatically selecting, by the oneor more computing devices, the second black-box optimization techniquefrom a plurality of available black-box optimization techniques.

Automatically selecting, by the one or more computing devices, thesecond black-box optimization technique from the plurality of availableblack-box optimization techniques can include automatically selecting,by the one or more computing devices, the second black-box optimizationtechnique from the plurality of available black-box optimizationtechniques based at least in part on one or more of: a total number oftrials associated with the study, a total number of adjustableparameters associated with the study, and a user-defined settingindicative of a desired processing time.

Generating, by the one or more computing devices, the suggested trialbased at least in part on the one or more results and the one or moresets of values can include requesting, by the one or more computingdevices via an internal abstract policy, generation of the suggestedtrial by an external custom policy provided by the user. Generating, bythe one or more computing devices, the suggested trial based at least inpart on the one or more results and the one or more sets of values caninclude receiving, by the one or more computing devices, the suggestedtrial from the external custom policy provided by the user.

Another aspect of the present disclosure is directed to acomputer-implemented method for use in optimization of parameter valuesfor machine-learning models. The method includes receiving, by one ormore computing devices, one or more prior evaluations of performance ofa machine learning model. The one or more prior evaluations arerespectively associated with one or more prior variants of themachine-learning model. The one or more prior variants of themachine-learning model each have been configured using a different setof adjustable parameter values. The method includes utilizing, by theone or more computing devices, an optimization algorithm to generate asuggested variant of the machine-learning model based at least in parton the one or more prior evaluations of performance and the associatedset of adjustable parameter values. The suggested variant of themachine-learning model is defined by a suggested set of adjustableparameter values. The method includes receiving, by the one or morecomputing devices, one or more intermediate evaluations of performanceof the suggested variant of the machine-learning model. The intermediateevaluations have been obtained from an ongoing evaluation of thesuggested variant of the machine-learning model. The method includesperforming, by the one or more computing devices, non-parametricregression, based on the intermediate evaluations and the priorevaluations, to determine whether to perform early-stopping of theongoing evaluation of the suggested variant of the machine-learningmodel. The method includes, in response to determining thatearly-stopping is to be performed, causing, by the one or more computingdevices, early-stopping to be performed in respect of the ongoingevaluation of the suggested variant of the machine-learning model.

Performing, by the one or more computing devices, non-parametricregression to determine whether to perform early-stopping of the ongoingevaluation of the suggested variant of the machine-learning model caninclude determining, by the one or more computing devices based on thenon-parametric regression, a probability of a final performance of thesuggested variant of the machine-learning model exceeding a current bestperformance as indicated by one of the prior evaluations of performanceof a prior variant of the machine-learning model. Performing, by the oneor more computing devices, non-parametric regression to determinewhether to perform early-stopping of the ongoing evaluation of thesuggested variant of the machine-learning model can include determining,by the one or more computing devices, whether to perform early-stoppingof the ongoing evaluation based on a comparison of the determinedprobability with a threshold.

Performing, by the one or more computing devices, non-parametricregression to determine whether to perform early-stopping of the ongoingevaluation of the suggested variant of the machine-learning model caninclude measuring, by the one or more computing devices, a similaritybetween a performance curve that is based on the intermediateevaluations and a performance curve corresponding to performance of acurrent best variant of the machine-learning model that is based on theprior evaluation for the current best variant of the machine-learningmodel.

The method can further include performing, by the one or more computingdevices, transfer learning to obtain initial values for the one or moreadjustable parameters of the machine-learning model. Performing, by theone or more computing devices, transfer learning can includeidentifying, by the one or more computing devices, a plurality ofpreviously-optimized machine-learned models, the plurality ofpreviously-optimized machine-learned models being organized in asequence. Performing, by the one or more computing devices, transferlearning can include building, by the one or more computing devices, aplurality of Gaussian Process regressors respectively for the pluralityof previously-optimized machine-learned models. The Gaussian Processregressor for each previously-optimized machine-learned model can betrained on one or more residuals relative to the Gaussian Processregressor for the previous previously-optimized machine-learned model inthe sequence.

Another aspect of the present disclosure is directed to a computersystem operable to suggest parameter values for machine-learned models.The computer system includes a database that stores one or more resultsrespectively associated with one or more sets of parameter values forone or more adjustable parameters of a machine-learned model. The resultfor each set of parameter values includes an evaluation of themachine-learned model constructed with such set of parameter values forthe one or more adjustable parameters. The computer system includes oneor more processors and one or more non-transitory computer-readablemedia that store instructions that, when executed by the one or moreprocessors, cause the computer system to perform operations. Theoperations include performing one or more black box optimizationtechniques to generate a suggested set of parameter values for the oneor more adjustable parameters of the machine-learned model based atleast in part on the one or more results and the one or more sets ofparameter values respectively associated with the one or more results.The operations include accepting an adjustment to the suggested set ofparameter values from a user. The adjustment includes at least onechange to the suggested set of parameter values to form an adjusted setof parameter values. The operations include receiving a new resultobtained through evaluation of the machine-learned model constructedwith the adjusted set of parameter values. The operations includeassociating the new result and the adjusted set of parameter values withthe one or more results and the one or more sets of parameter values inthe database.

The operations can further include generating a second suggested set ofparameter values for the one or more adjustable parameters of themachine-learned model based at least in part on the new result for theadjusted set of parameter values.

The one or more adjustable parameters of the machine-learned model caninclude one or more adjustable hyperparameters of the machine-learnedmodel.

The operations can further include performing a plurality of rounds ofgeneration of suggested sets of parameter values using at least twodifferent black box optimization techniques.

The operations can further include automatically changing black boxoptimization techniques between at least two of the plurality of roundsof generation of suggested sets of parameter values.

The at least two different black box optimization techniques can bestateless so as to enable switching between black box optimizationtechniques between at least two of the plurality of rounds of generationof suggested sets of parameter values.

The operations can further include performing a plurality of rounds ofgeneration of suggested sets of parameter values. The operations canfurther include receiving a change to a feasible set of values for atleast one of the one or more adjustable parameters of themachine-learned model between at least two of the plurality of rounds ofgeneration of suggested sets of parameter values.

The operations can further include receiving intermediate statisticsassociated with an ongoing evaluation of an additional set of parametervalues. The operations can further include performing non-parametricregression on the intermediate statistics to determine whether toperform early stopping of the ongoing evaluation.

The operations can further include performing transfer learning toobtain initial parameter values for the one or more adjustableparameters. Performing transfer learning can include identifying aplurality of previously studied machine-learned models. The plurality ofpreviously studied machine-learned models can be organized in asequence. Performing transfer learning can include building a pluralityof Gaussian Process regressors respectively for the plurality ofpreviously studied machine-learned models. The Gaussian Processregressor for each previously studied machine-learned model can betrained on one or more residuals relative to the Gaussian Processregressor for the previous previously studied machine-learned model inthe sequence.

The operations can further include providing for display a parallelcoordinates visualization of the one or more results and the one or moresets of parameter values for the one or more adjustable parameters.

Another aspect of the present disclosure is directed to acomputer-implemented method to suggest parameter values formachine-learned models. The method includes receiving, by the one ormore computing devices, one or more results respectively associated withone or more sets of parameter values for one or more adjustableparameters of a machine-learned model. The result for each set ofparameter values includes an evaluation of the machine-learned modelconstructed with such set of parameter values for the one or moreadjustable parameters. The method includes generating, by the one ormore computing devices, a suggested set of parameter values for the oneor more adjustable parameters of the machine-learned model based atleast in part on the one or more results and the one or more sets ofparameter values respectively associated with the one or more results.The method includes receiving, by the one or more computing devices, anadjustment to the suggested set of parameter values from a user. Theadjustment includes at least one change to the suggested set ofparameter values to form an adjusted set of parameter values. The methodincludes receiving, by the one or more computing devices, a new resultassociated with the adjusted set of parameter values. The methodincludes associating, by the one or more computing devices, the newresult and the adjusted set of parameter values with the one or moreresults and the one or more sets of parameter values.

The one or more adjustable parameters of the machine-learned model caninclude one or more adjustable hyperparameters of the machine-learnedmodel.

The method can further include generating, by the one or more computingdevices, a second suggested set of parameter values for the one or moreadjustable parameters of the machine-learned model based at least inpart on the new result for the adjusted set of parameter values.

Generating, by the one or more computing devices, the suggested set ofparameter values can include performing, by the one or more computingdevices, a first black box optimization technique to generate thesuggested set of parameter values based at least in part on the one ormore results and the one or more sets of parameter values. Generating,by the one or more computing devices, the second suggested set ofparameter values can include performing, by the one or more computingdevices, a second black box optimization technique to generate thesecond suggested set of parameter values based at least in part on thenew result for the adjusted set of values. The second black boxoptimization technique can be different from the first black boxoptimization technique.

The method can further include, prior to performing, by the one or morecomputing devices, the second black box optimization technique togenerate the second suggested set of parameter values, receiving, by theone or more computing devices, a user input that selects the secondblack box optimization technique from a plurality of available black boxoptimization techniques.

The method can further include, prior to performing, by the one or morecomputing devices, the second black box optimization technique togenerate the second suggested set of parameter values, automaticallyselecting, by the one or more computing devices, the second black boxoptimization technique from a plurality of available black boxoptimization techniques.

Automatically selecting, by the one or more computing devices, thesecond black box optimization technique from the plurality of availableblack box optimization techniques can include automatically selecting,by the one or more computing devices, the second black box optimizationtechnique from the plurality of available black box optimizationtechniques based at least in part on one or more of: a total number ofresults associated with the machine-learned model, a total number ofadjustable parameters associated with the machine-learned model, and auser-defined setting indicative of a desired processing time.

Generating, by the one or more computing devices, the suggested set ofparameter values based at least in part on the one or more results andthe one or more sets of parameter values can include requesting, by theone or more computing devices via an internal abstract policy,generation of the suggested set of parameter values by an externalcustom policy provided by the user. Generating, by the one or morecomputing devices, the suggested set of parameter values based at leastin part on the one or more results and the one or more sets of parametervalues can include receiving, by the one or more computing devices, thesuggested set of parameter values from the external custom policyprovided by the user.

Another aspect of the present disclosure is directed to acomputer-implemented method for black box optimization of parameters ofa system, product, or process. The method includes performing, by one ormore computing devices, one or more iterations of a sequence ofoperations. The sequence of operations includes determining, by the oneor more computing devices, whether to sample an argument value from afeasible set of argument values using a first approach or using a secondapproach. Each argument value of the feasible set defines values foreach of plural parameters of a system, product, or process. The sequenceof operations includes, based on the determination, sampling, by the oneor more computing devices, the argument value using the first approachor the second approach. The first approach includes sampling, by the oneor more computing devices, the argument value at random from thefeasible set and the second approach includes sampling, by the one ormore computing devices, the argument value from a subset of the feasibleset that is defined based on a ball around a current best argumentvalue. The sequence of operations includes determining, by the one ormore computing devices, whether a performance measure of the system,product, or process that has been determined using parameters defined bythe sampled argument value is closer-to-optimal than a currentclosest-to-optimal performance measure. The sequence of operationsincludes, if the performance measure is closer-to-optimal than thecurrent closest-to-optimal performance measure, updating, by the one ormore computing devices, the current best argument value based on thesampled argument value. After completion of a final iteration of thesequence, the method includes outputting, by the one or more computingdevices, the values of the parameters defined by the current bestargument value for use in configuration of the system, formulation ofthe product or execution of the process.

The ball can be localized around the current best argument value and candefine a boundary of the subset of the feasible set from which samplingis performed in the second approach.

The ball can be defined by a radius that is selected at random from ageometric series of radii.

An upper limit on the geometric series of radii can be dependent on adiameter of a dataset, a resolution of the dataset and a dimensionalityof an objective function.

The determination whether to sample the argument value from the feasibleset of argument values using the first approach or using the secondapproach can be probabilistic.

Sampling the argument value using the second approach can includedetermining, by the one or more computing devices, the argument valuefrom the subset of the feasible set that is bounded by the ball that islocalized around the current best argument value. Sampling the argumentvalue using the second approach can include projecting, by the one ormore computing devices, the determined argument value onto the feasibleset of argument values, thereby to obtain the sampled argument value.

Another aspect of the present disclosure is directed to a computersystem operable to perform black box optimization. The computer systemincludes one or more processors and one or more non-transitorycomputer-readable media that store instructions that, when executed bythe one or more processors, cause the computer system to performoperations. The operations include identifying a best observed set ofvalues for one or more adjustable parameters. The operations includedetermining a radius. The operations include generating a ball that hasthe radius around the best observed set of values for the one or moreadjustable parameters. The operations include determining a randomsample from within the ball. The operations include determining asuggested set of values for the one or more adjustable parameters basedat least in part on the random sample from within the ball.

Determining the radius can include randomly sampling the radius fromwithin a geometric series.

Determining the radius can include determining the radius based at leastin part on a user-defined resolution term.

Determining the radius can include randomly sampling the radius from adistribution of available radii that has a minimum equal to auser-defined resolution term.

Determining the radius can include randomly sampling the radius from adistribution of available radii that has a maximum that is based atleast in part on a diameter of a feasible set of values for the one ormore adjustable parameters.

Determining the suggested set of values for the one or more adjustableparameters based at least in part on the random sample from within theball can include selecting, as the suggested set of values, a projectionof the random sample from within the ball onto a feasible set of valuesfor the one or more adjustable parameters.

The operations can further include receiving a result obtained throughevaluation of the suggested set of values. The operations can furtherinclude comparing the result to a best observed result obtained throughevaluation of the best observed set of values to determine whether toupdate the best observed set of values to equal the suggested set ofvalues.

The operations can further include determining, according to auser-defined probability, whether to select a random sample from afeasible set of values for the one or more adjustable parameters as thesuggested set of values rather than determine the suggested set ofvalues based at least in part on the random sample from within the ball.

Another aspect of the present disclosure is directed to acomputer-implemented method to perform black box optimization. Themethod includes performing, by one or more computing devices, aplurality of suggestion rounds to respectively suggest a plurality ofsuggested sets of values for one or more adjustable parameters.Performing each suggestion round includes determining, by the one ormore computing devices, whether to perform a random sampling techniqueor a ball sampling technique. Performing each suggestion round includes,when it is determined to perform the random sampling technique:determining, by the one or more computing devices, a random sample froma feasible set of values for the one or more adjustable parameters; andselecting, by the one or more computing devices, the random sample asthe suggested set of values for the one or more adjustable parametersfor the current suggestion round. Performing each suggestion roundincludes, when it is determined to perform the ball sampling technique:determining, by the one or more computing devices, a radius; generating,by the one or more computing devices, a ball that has the radius arounda best observed set of values for the one or more adjustable parameters;determining, by the one or more computing devices, a random sample fromwithin the ball; and determining, by the one or more computing devices,the suggested set of values for the current suggestion round based atleast in part on the random sample from within the ball.

Determining, by the one or more computing devices, the radius caninclude randomly sampling, by the one or more computing devices, theradius from within a geometric series.

Determining, by the one or more computing devices, the radius caninclude determining, by the one or more computing devices, the radiusbased at least in part on a user-defined resolution term.

Determining, by the one or more computing devices, the radius caninclude randomly sampling, by the one or more computing devices, theradius from a distribution of available radii that has a minimum equalto a user-defined resolution term.

Determining, by the one or more computing devices, the radius caninclude randomly sampling, by the one or more computing devices, theradius from a distribution of available radii that has a maximum that isbased at least in part on a diameter of a feasible set of values for theone or more adjustable parameters.

Determining, by the one or more computing devices, the suggested set ofvalues for the one or more adjustable parameters based at least in parton the random sample from within the ball can include selecting, by theone or more computing devices as the suggested set of values, aprojection of the random sample from within the ball onto a feasible setof values for the one or more adjustable parameters.

Performing each suggestion round can further include receiving, by theone or more computing devices, a result obtained through evaluation ofthe suggested set of values. Performing each suggestion round canfurther include comparing the result to a best observed result obtainedthrough evaluation of the best observed set of values to determinewhether to update the best observed set of values to equal the suggestedset of values.

Determining, by the one or more computing devices, whether to performthe random sampling technique or the ball sampling technique can includedetermining, by the one or more computing devices, whether to performthe random sampling technique or the ball sampling technique accordingto a predefined probability.

Determining, by the one or more computing devices, whether to performthe random sampling technique or the ball sampling technique can includedetermining, by the one or more computing devices, whether to performthe random sampling technique or the ball sampling technique accordingto a user-defined probability.

Other aspects of the present disclosure are directed to various systems,apparatuses, non-transitory computer-readable media, user interfaces,and electronic devices.

These and other features, aspects, and advantages of various embodimentsof the present disclosure will become better understood with referenceto the following description and appended claims. The accompanyingdrawings, which are incorporated in and constitute a part of thisspecification, illustrate example embodiments of the present disclosureand, together with the description, serve to explain the relatedprinciples.

BRIEF DESCRIPTION OF THE DRAWINGS

Detailed discussion of embodiments directed to one of ordinary skill inthe art is set forth in the specification, which makes reference to theappended figures, in which:

FIG. 1 depicts a block diagram of an example computing systemarchitecture according to example embodiments of the present disclosure.

FIG. 2 depicts a block diagram of an example computing systemarchitecture according to example embodiments of the present disclosure.

FIG. 3 depicts a graphical diagram of an example dashboard userinterface according to example embodiments of the present disclosure.

FIG. 4 depicts a graphical diagram of an example parallel coordinatesvisualization according to example embodiments of the presentdisclosure.

FIG. 5 depicts a graphical diagram of an example transfer learningscheme according to example embodiments of the present disclosure.

FIG. 6 depicts graphical diagrams of example experimental resultsaccording to example embodiments of the present disclosure.

FIG. 7 depicts a graphical diagram of example experimental resultsaccording to example embodiments of the present disclosure.

FIG. 8 depicts graphical diagrams of example experimental resultsaccording to example embodiments of the present disclosure.

FIG. 9 depicts an example illustration of β-balancedness for twofunctions according to example embodiments of the present disclosure.

FIG. 10 depicts an example illustration of a ball sampling analysisaccording to example embodiments of the present disclosure.

FIG. 11 depicts an example illustration of a ball sampling analysisaccording to example embodiments of the present disclosure.

FIG. 12 depicts a block diagram of an example computing system accordingto example embodiments of the present disclosure.

FIG. 13 depicts a flow chart diagram of an example method to performblack-box optimization according to example embodiments of the presentdisclosure.

FIG. 14 depicts a flow chart diagram of an example method to perform aball sampling technique according to example embodiments of the presentdisclosure.

DETAILED DESCRIPTION

Black box optimization can be used to find the best operating parametersfor any system, product, or process whose performance can be measured orevaluated as a function of those parameters. It has many importantapplications. For instance, it may be used in the optimization ofphysical systems and products, such as the optimization of theconfiguration of aero-foils (e.g., optimizing airfoil shapes based oncomputer simulations of flight performance) or the optimization of theformulation of alloys or metamaterials. Other uses include theoptimization (or tuning) of hyperparameters of machine learning systems,such as learning rates or the number of hidden layers in a deep neuralnetwork.

Two important considerations with respect to black-box optimization arethe performance of the optimization and the expenditure of resources(e.g., computational resources) required to perform the optimization.With regard to resources, these may be expended as a result of thefunction evaluations (that is the evaluation of the performance of aparticular variant of the system, product, or process as defined by aparticular set of parameters) or as a result of the execution of theoptimization algorithm for determining a next set of parameters for usein the next performance evaluation.

Described herein are computing systems and associated methods which mayserve to reduce the expenditure of resources when performingoptimization of the parameters of a system, product, or process. Variousaspects (including those relating to early-stopping and the ability tooverride system-suggested parameters) may serve to reduce resourceexpenditure resulting from function evaluation, while others (forinstance those relating to the “Gradientless Descent” optimizationalgorithm provided by the present disclosure) may serve to reducecomputational resource expenditure resulting from execution of theoptimization algorithm.

More generally, the present disclosure is directed to computing systemsand associated methods for optimizing one or more adjustable parameters(e.g. operating parameters) of a system. In particular, the presentdisclosure provides a parameter optimization system that can perform oneor more black-box optimization techniques to iteratively suggest newsets of parameter values for evaluation. The system can interface with auser device to receive results obtained through the evaluation of thesuggested parameter values by the user. Alternatively or additionally,the parameter optimization system can provide an evaluation service thatevaluates the suggested parameter values using one or more evaluationdevices. Through the use of black-box optimization techniques, thesystem can iteratively suggest new sets of parameter values based on thereturned results. The iterative suggestion and evaluation process canserve to optimize or otherwise improve the overall performance of thesystem, as evaluated by an objective function that evaluates one or moremetrics.

In some implementations, the parameter optimization system of thepresent disclosure may utilize a novel parameter optimization techniqueprovided herein which is referred to as “Gradientless Descent.”Gradientless Descent, which is discussed in more detail below, providesa mix between the benefits of truly random sampling and random samplingnear a best observed set of parameter values to date. GradientlessDescent also converges exponentially fast under relatively weakconditions and is highly effective in practice. By converging fast, itis possible to reach an acceptable degree of optimization in feweriterations, thereby reducing the total computation associated with theoptimization. Moreover, because Gradientless Descent is a relativelysimple algorithm, the computational resources required to execute thealgorithm are low, particularly when compared with alternative, morecomplex optimization approaches such as Bayesian Optimization. Inaddition, as is explained in more detail below, under certainconditions, the performance of Gradientless Descent may dominate that ofBayesian Optimization, despite its simplicity. As such, GradientlessDescent may provide both improved optimization and reduced computationalresource expenditure, when compared with alternative approaches such asBayesian Optimization.

The parameter optimization system of the present disclosure can beemployed to simultaneously optimize or otherwise improve adjustableparameters associated with any number of different systems including,for example, one or more different models, products, processes, and/orother systems. In particular, in some implementations, the parameteroptimization system can include or provide a service that allows userscan to create and run “studies” or “optimization procedures”. A study oroptimization procedure can include a specification of a set ofadjustable parameters that affect the quality, performance, or outcomeof a system. A study can also include a number of trials, where eachtrial includes a defined set of values for the adjustable parameterstogether with the results of conducting the trial (once available). Theresults of a trial can include any relevant metric that describes thequality, performance, or outcome of the system (e.g., in the form of theobjective function) that results from use of the set of values definedfor such trial. Put another way, each trial may correspond to aparticular variant of the model, product, process, or system as definedby the set of values for the adjustable parameters. The results of thetrial may include a performance evaluation (or a performance measure) ofthe variant to which the trial relates.

In one particular example application, the parameter optimization systemcan be employed to optimize the parameters of a machine-learned modelsuch as, for example, a deep neural network. For example, the adjustableparameters of the model can include hyperparameters such as, forexample, learning rate, number of layers, number of nodes in each layer,etc. Through the use of black-box optimization technique(s), theparameter optimization system can iteratively suggest new sets of valuesfor the model parameters to improve the performance of the model. Forexample, the performance of the model can be measured according todifferent metrics such as, for example, the accuracy of the model (e.g.,on a validation data set or testing data set).

In another example application, the parameter optimization system can beemployed to optimize the adjustable parameters (e.g., component oringredient type or amount, production order, production timing) of aphysical product or process of producing a physical product such as, forexample, an alloy, a metamaterial, a concrete mix, a process for pouringconcrete, a drug cocktail, or a process for performing therapeutictreatment. Additional example applications include optimization of theuser interfaces of web services (e.g. optimizing colors and fonts tomaximize reading speed) and optimization of physical systems (e.g.,optimizing airfoils in simulation).

As another example, in some instances, an experiment such as ascientific experiment with a number of adjustable parameters can beviewed as a system or process to be optimized.

More generally, the parameter optimization system and associatedtechniques provided herein can be applied to a wide variety of products,including any system, product, or process that can be specified by, forexample, a set of components and/or operating/processing parameters.Thus, in some implementations, the parameter optimization system can beused to perform optimization of products (e.g., personalized products)via automated experimental design.

According to an aspect of the present disclosure, the parameteroptimization system can perform a black-box optimization technique tosuggest a new set of parameter values for evaluation based on thepreviously evaluated sets of values and their corresponding resultsassociated with a particular study. The parameter optimization system ofthe present disclosure can use any number of different types ofblack-box optimization techniques, including the aforementioned noveloptimization technique provided herein which is referred to as“Gradientless Descent.”

Black-box optimization techniques make minimal assumptions about theproblem under consideration, and thus are broadly applicable across manydomains. Black-box optimization has been studied in multiple scholarlyfields under names including Bayesian Optimization (see, e.g., Bergstraet al. 2011. Algorithms for hyper-parameter optimization. In Advances inNeural Information Processing Systems. 2546-2554; Shahriari et al. 2016.Taking the human out of the loop: A review of Bayesian optimization.Proc. IEEE 104, 1 (2016), 148-175; and Snoek et al. 2012. PracticalBayesian optimization of machine learning algorithms. In Advances inneural information processing systems. 2951-2959); Derivative-freeoptimization (see, e.g., Conn et al. 2009. Introduction toderivative-free optimization. SIAM; and Rios and Sahinidis. 2013.Derivative-free optimization: a review of algorithms and comparison ofsoftware implementations. Journal of Global Optimization 56, 3 (2013),1247-1293), Sequential Experimental Design (see, e.g., Chernoff. 1959.Sequential Design of Experiments. Ann. Math. Statist. 30, 3 (09 1959),755-770); and assorted variants of the multiarmed bandit problem (see,e.g., Ginebra and Clayton. 1995. Response Surface Bandits. Journal ofthe Royal Statistical Society. Series B (Methodological) 57, 4 (1995),771-784; Li et al. 2016. Hyperband: A Novel Bandit-Based Approach toHyperparameter Optimization. CoRR abs/1603.06560 (2016); and Srinivas etal. 2010. Gaussian Process Optimization in the Bandit Setting: No Regretand Experimental Design. ICML (2010)).

Several classes of algorithms are included under the umbrella ofblack-box optimization techniques. The simplest of these arenon-adaptive procedures such as Random Search, which selects x_(τ)uniformly at random from X at each time step t independent of theprevious points selected, x_(τ): 1≤τ<t, and Grid Search, which selectsalong a grid (e.g., the Cartesian product of finite sets of feasiblevalues for each parameter). Classic algorithms such as SimulatedAnnealing and assorted genetic algorithms have also been investigated,including, for example, Covariance Matrix Adaptation (Hansen et al.,1996, Adapting Arbitrary Normal Mutation Distributions in EvolutionStrategies: The Covariance Matrix Adaptation, Proc. IEEE (ICEC '96),312-317).

Another class of black-box optimization algorithms performs a localsearch by selecting points that maintain a search pattern, such as asimplex in the case of the classic Nelder-Mead algorithm (Nelder andMead. 1965. A simplex method for function minimization. The ComputerJournal 7, 4 (1965), 308-313). More modern variants of these algorithmsmaintain simple models of the objective ƒ within a subset of thefeasible regions (called the trust region), and select a point x_(t) toimprove the model within the trust region (see, e.g., Conn et al. 2009.Introduction to derivative-free optimization. SIAM).

More recently, some researchers have combined powerful techniques formodeling the objective ƒ over the entire feasible region, using ideasdeveloped for multiarmed bandit problems for managing explore/exploittrade-offs. These approaches are fundamentally Bayesian in nature, hencethis literature goes under the name Bayesian Optimization. Typically,the model for ƒ is a Gaussian process (e.g., as in Snoek et al. 2012.Practical Bayesian optimization of machine learning algorithms. Advancesin neural information processing systems, 2951-2959; and Srinivas et al.2010. Gaussian Process Optimization in the Bandit Setting: No Regret andExperimental Design. ICML (2010)); a deep neural network (e.g., as inSnoek et al. 2015. Scalable Bayesian Optimization Using Deep NeuralNetworks. Proceedings of the 32nd International Conference on MachineLearning, ICML 2015, Lille, France, 6-11 Jul. 2015, Vol. 37, 2171-2180;and Wilson et al. 2016. Deep kernel learning. Proceedings of the 19thInternational Conference on Artificial Intelligence and Statistics,370-378); or a regression forest (e.g., as in Bergstra et al. 2011.Algorithms for hyper-parameter optimization. Advances in NeuralInformation Processing Systems, 2546-2554; and Hutter et al. 2011.Sequential model-based optimization for general algorithm configuration.International Conference of Learning and Intelligent Optimization,Springer, 507-523). The parameter optimization system of the presentdisclosure can perform or support performance of any of the optimizationtechniques described above in addition to other black box optimizationtechniques not specifically identified.

Many of these black-box optimization algorithms have open-sourceimplementations available. Within the machine learning community, opensourced examples include HyperOpt, Metric Optimization Engine (MOE),Spearmint, and AutoWeka, among others.

In contrast to such software packages, which require practitioners toset up and run them locally, the system of the present disclosureprovides a managed service for black-box optimization, which is moreconvenient for users but also involves additional design considerations.In particular, the parameter optimization system of the presentdisclosure can include a unique architecture which features a convenientRemote Procedure Call (RPC) and can support a number of advancedfeatures such as transfer learning, automated early stopping, dashboardand analysis tools, and others, as will be described in further detailbelow.

As one example of such advanced features, according to an aspect of thepresent disclosure, the parameter optimization system can enable orperform dynamic switching between optimization algorithms duringoptimization of a set of system parameters. For example, the system candynamically change black-box optimization techniques between at leasttwo of the plurality of rounds of generation of suggested trials,including while other trials are ongoing.

In particular, in some implementations, some or all of the supportedblack-box optimization techniques can be stateless in nature so as toenable such dynamic switching. For example, in some implementations, theoptimization algorithms supported by the parameter optimization systemcan be computed from or performed relative to the data stored in thesystem database, and nothing else, where all state is stored in thedatabase. Such a configuration provides a major operational advantage:the state of the database can be changed (e.g., changed arbitrarily) andthen processes, algorithms, metrics, or other methods can be performed“from scratch” (e.g., without relying on previous iterations of theprocesses, algorithms, metrics, or other methods).

In some implementations, the switch between optimization algorithms canbe automatically performed by the parameter optimization system. Forexample, the parameter optimization system can automatically switchbetween two or more different black box optimization techniques based onone or more factors, including, for example: a total number of trialsassociated with the study; a total number of adjustable parametersassociated with the study; and a user-defined setting indicative of adesired processing time. As an example, a first black-box optimizationtechnique may be superior when the number of previous trials to consideris low, but may become undesirably computationally expensive when thenumber of trials reaches a certain number, while a second black-boxoptimization technique may be superior (e.g., because it is lesscomputationally expensive) when the number of previous trials toconsider is very high. Thus, in one example, when the total number oftrials associated with the study reaches a threshold amount, theparameter optimization system can automatically switch from use of thefirst technique to use of the second technique. More generally, theparameter optimization system can continuously or periodically considerwhich of a plurality of available black-box optimization techniques isbest suited for performance of the next round of suggestion, given thecurrent status of the study (e.g., number of trials, number ofparameters, shape of data and previous trials, feasible parameter space)and any other information including user-provided guidance aboutprocessing time/expenditure or other tradeoffs. Thus, a partnershipbetween a human user and the parameter optimization system can guideselection of the appropriate black-box optimization technique at eachinstance of suggestion.

In addition or alternatively to automatic switching between optimizationalgorithms, the parameter optimization system can support manualswitching between optimization algorithms. Thus, a user of the systemcan manually specify which of a number of available techniques should beused for a given round of suggestion.

According to another aspect of the present disclosure, the parameteroptimization system can provide the ability to override a suggestedtrial provided by the system with changes to the suggested trial. Thatis, the parameter optimization system can provide a suggested set ofvalues for the adjustable parameters of the study, and then receive andaccept an adjustment to the suggested trial from a user, where theadjustment includes at least one change to the suggested set of valuesto form an adjusted set of values. The user can provide a resultobtained through evaluation of the adjusted set of values and the newresult and the adjusted set of values can be associated with the studyas a completed trial.

Providing the ability to adjust a suggested trial enables a user tomodify the suggested trial when, for any reason, the user is aware thatthe suggested trial will not provide a positive result or is otherwiseinfeasible or impractical to evaluate. For example, based on experiencethe user may be aware that the suggested trial will not provide apositive result. The user can adjust the suggested trial to provide anadjusted trial that is more likely to provide an improved result. Theability to adjust a suggested trial can save time and computationexpense as suggested trials that are known ex ante to correspond to poorresults are not required to be evaluated and, in fact, can be replacedwith more useful adjusted trials. As another example benefit, suggestedtrials that would require substantial time or expenditure ofcomputational resources to evaluate (e.g., due to the particularparameter value provided by the suggested trial) are not required to beevaluated and, in fact, can be replaced with adjusted trials that areless computationally expensive to evaluate. Thus, again, the parameteroptimization system can enable and leverage a partnership between ahuman user and the parameter optimization system to improvecomputational resource expenditure, time or other attributes of thesuggestion/evaluation process.

In some studies, it may be possible to know (e.g., after the experimentis started or completed) the parameter values that were used in a Trial,yet it may not be practical to precisely control said parameters. Oneexample happens in mixed-initiative systems, where the parameteroptimization system would suggest experiments (e.g. recipes) to a human(e.g., human chef), and the human has the right to modify the experiment(e.g., recipe) so long as he/she reports what was actually evaluated(e.g., cooked).

According to another aspect of the present disclosure, the parameteroptimization system can provide the ability to change a feasible set ofparameter values for one or more of the adjustable parameters while astudy is pending. Thus, should new information come to light or newjudgments be made about the feasible set of values for a particularparameter, the parameter optimization system can support changes to thefeasible set of values by a user, while a study is pending.

According to yet another aspect of the present disclosure, the parameteroptimization system can provide the ability to ask for additionalsuggestions at any time and/or report back results at any time. Thus, insome implementations, the parameter optimization system can supportparallelization and/or be designed asynchronously.

In some implementations, the parameter optimization system can performbatching of requests for and provision of suggestions. Thus, the systemcan batch at least a portion of a plurality of requests for additionalsuggested trials and, in response, generate the additional suggestedtrials as a batch. For example, fifty computing devices can collectivelymake a single request for fifty suggestions which can be generated inone batch.

More particularly, in some instances it may be desired for the system tosuggest multiple trials to run in parallel. The multiple trials shouldcollectively contain a diverse set of parameter values that are believedto provide “good” results. Performing such batch suggestion requires theparameter optimization system to have some additional algorithmicsophistication. That is, instead of simply picking the “best” singlesuggestion (e.g., as provided by a particular black-box optimizationtechnique based on currently available results), the parameteroptimization system can provide multiple suggestions that do not containduplicates or that are otherwise intelligently selected relative to eachother. For example, in some implementations, suggested trials can beconditioned on pending trials or other trials that are to be suggestedwithin the same batch.

As one example, in some implementations, the parameter optimizationsystem can hallucinate or synthesize poor results for pending trials orother trials that are to be suggested within the same batch, therebyguiding the black-box optimization technique away from providing aduplicate suggestion. In some implementations, the “hallucinated”results are temporary and transient. That is, each hallucinated valuemay last only from the moment a Trial is suggested to the moment theevaluation is complete. Thus, in some implementations, thehallucinations can exist solely to reserve some space, and to preventanother, very similar Trial from being suggested nearby, until the firstone is complete.

In addition, in some implementations, the multiple suggestions providedby the parameter optimization system can lead to more specific andprecise evaluation of a particular adjustable parameter. For example, insome implementations, some or all but one of the adjustable parameterscan be constrained (e.g., held constant or held within a definedsub-range) while multiple suggested values are provided for thenon-constrained parameter(s). In such fashion, trials can be suggestedthat help to identify data around a particular parameter or a particularrelationship between two or more parameters. The selection ofconstrained versus non-constrained parameters can be user-guided orautomatically selected based on the results of previous trials.

According to another aspect of the present disclosure, the parameteroptimization system can perform or support early stopping of pendingtrials. For example, the system can implement or otherwise support useof one or more automated stopping algorithms that evaluate theintermediate statistics (e.g., initial results) of a pending trial todetermine whether to perform early stopping of the trial, thereby savingresources that would otherwise be consumed by completing a trial that isnot likely to provide a positive result. As one example, the system canimplement or otherwise support use of a performance curve stopping rulethat performs regression on a performance curve to make a prediction ofthe final result (e.g., objective function value) of a trial. Inparticular, while certain existing early stopping techniques useparametric regression, the performance curve stopping rule provided bythe present disclosure is unique in that is uses non-parametricregression.

Put in other terms, the parameter optimization system can provide theability to receive one or more intermediate evaluations of performanceof the suggested variant (or trial), the intermediate evaluations havingbeen obtained from an ongoing evaluation of the suggested variant. Basedon the intermediate evaluations and prior evaluations in respect ofprior variants (or trials), non-parametric regression may be performedin order to determine whether to perform early-stopping of the ongoingevaluation of the suggested variant. In response to determining thatearly-stopping is to be performed, early-stopping of the ongoingevaluation may be caused or an indication that early-stopping should beperformed may be provided.

As already mentioned, the ability of the system to perform earlystopping may reduce the expenditure of computational resources that areassociated with continuing the performance of on-going variantevaluations which are determined to be unlikely to ultimately yield afinal performance evaluation that is in excess of a current-bestperformance evaluation. Indeed, the non-parametric early stoppingdescribed herein has been found to achieve optimality gaps, when tuninghyper-parameters for deep neural networks, that are comparable to thoseachieved without making use of early stopping, while using approximately50% fewer CPU hours.

More specifically, performing non-parametric regression to determinewhether to perform early-stopping of the ongoing evaluation of thesuggested variant can include determining, based on the non-parametricregression, a probability of a final performance of the suggestedvariant exceeding a current best performance as indicated by one of theprior evaluations of performance of a prior variant. In someimplementations, the determination as to whether to performearly-stopping may then be performed based on a comparison of thedetermined probability with a threshold.

Performance of non-parametric regression to determine whether to performearly-stopping of the ongoing evaluation of the suggested variant caninclude measuring a similarity between a performance curve that is basedon the intermediate evaluations and a performance curve corresponding toperformance of a current best variant that is based on the priorevaluation for the current best variant.

According to yet another aspect of the present disclosure, the parameteroptimization system can perform or support transfer learning betweenstudies. In particular, the parameter optimization system of the presentdisclosure can support a form of transfer learning that allows user toleverage data from prior studies to guide and accelerate their currentstudy. As an example, in some implementations, the system can employ anovel transfer learning process that includes building a plurality ofGaussian Process regressors respectively for a plurality of previouslyconducted studies that are organized into a sequence (e.g., a temporalsequence). In particular, the Gaussian Process regressor for each studycan be trained on one or more residuals relative to the Gaussian Processregressor for the previous study in the sequence. This novel transferlearning technique ensures a certain degree of robustness since badlychosen priors will not harm the prediction asymptotically. The transferlearning capabilities described herein can be particularly valuable whenthe number of trials per study is relatively small, but there are manyof such studies.

According to another aspect of the present disclosure, the parameteroptimization system can provide a mechanism, referred to herein as an“algorithm playground,” for advanced users to easily, quickly, andsafely replace the core optimization algorithms supported by the systemwith arbitrary algorithms supported by the user. The algorithmplayground allows users to inject trials into a study. Morespecifically, the algorithm playground can include an internal abstractpolicy that interfaces with a custom policy provided by a user. Thus, insome implementations, the parameter optimization system can request, viaan internal abstract policy, generation of the suggested trial by theexternal custom policy provided by the user. The parameter optimizationsystem can then receive a suggested trial from the external custompolicy, thereby allowing a user to employ any arbitrary custom policy toprovide suggested trials which will be incorporated in the study.

According to yet another aspect of the present disclosure, the parameteroptimization system can include a dashboard and analysis tools. Theweb-based dashboard can be used for monitoring and/or changing the stateof studies. The dashboard can be fully featured and implement the fullfunctionality of a system API. The dashboard can be used for trackingthe progress of the study; interactive visualizations; creating, update,and/or deleting a study; requesting new suggestions, early stopping,activating/deactivating a study; or other actions or interactions. Asone example, the interactive visualizations accessible via the dashboardcan include a parallel coordinates visualization that visualizes the oneor more results relative to the respective values for each parameterdimension that are associated with the completed trials.

The parameter optimization system of the present disclosure also has thebenefit of enabling post-facto tuning of black-box optimizationalgorithms. In particular, in the event that users provide consent foruse of their study data, data from a significant number of studies canbe used to tune different optimization techniques or otherwise evaluatethe outcomes from use of such different optimization techniques, therebyenabling a post-hoc evaluation of algorithm performance.

According to another aspect of the present disclosure, in some exampleapplications the parameter optimization system can be employed to notonly generally optimize a system such as a product or process, but canbe used to optimize the system relative to a particular application orparticular subset of individuals. As an example, a study can beperformed where the results are limited to feedback from or relative toa particular scenario, application, or subset of individuals, therebyspecifically optimizing the system for such particular scenario,application, or subset of individuals.

To provide an example, as described above, the parameter optimizationsystem can be used to generally optimize the adjustable parameters of aprocess of pouring concrete (e.g., ingredient type or volume, ordering,timing, operating temperatures, etc.). In addition, by limiting thetrials and/or the evaluation thereof to a particular scenario (e.g.,ambient temperature conditions between 60 degrees and 65 degreesFahrenheit; elevation between 1250 feet and 1350 feet; surrounding soilconditions of a certain type; etc.) the adjustable parameters of theconcrete pouring process can be optimized relative to such particularscenario. To provide another example, the adjustable parameters of auser interface (e.g., font, color, etc.) can be optimized relative to aspecific subset of users (e.g., engineers that live in Pittsburgh, Pa.).Thus, in some implementations, the parameter optimization system can beused to perform personalized or otherwise specialized optimization ofsystems such as products or processes.

As already mentioned, the present disclosure provides a novel black-boxoptimization technique which is referred to herein as “GradientlessDescent.”

In particular, in some implementations, the Gradientless Descenttechnique can be employed (e.g., by the parameter optimization system)in an iterative process that includes a plurality of rounds ofsuggestion and evaluation. More particularly, each suggestion round canresult in a suggested set of parameter values (e.g., a suggestedtrial/variant), which may be defined by a sampled “argument value”.Thus, a single iteration of Gradientless Descent can be performed toobtain a new suggestion (e.g., suggested variant/trial). However, aswith most black-box optimization techniques, multiple iterations ofsuggestion and evaluation (e.g., reporting of results) are used tooptimize the objective function.

In some implementations, at each iteration, the Gradientless Descenttechnique can include a choice between a random sampling technique or aball sampling technique. In the random sampling technique, a randomsample is determined from a feasible set of values for the one or moreadjustable parameters. In the ball sampling technique, a ball is formedaround a best observed set of values and a random sample can bedetermined from within the ball. In particular, the ball can belocalized around the current best argument value and can define aboundary of a subset of the feasible set from which sampling isperformed in the ball sampling approach.

In some implementations, the choice between the random samplingtechnique or the ball sampling technique can be performed with orotherwise guided by a predefined probability. For example, the randomsampling technique can be selected with some probability while the ballsampling technique is selected with the inverse probability. In someimplementations, the probability is user-defined. In someimplementations, the probability is fixed while in other implementationsthe probability changes as iterations of the technique are performed(e.g., increasingly weighted towards the ball sampling technique overtime). In some implementations, the probability can be adaptive orotherwise responsive to outcomes (e.g., trial results).

According to another aspect, a radius of the ball can be determined ateach iteration in which the ball sampling technique is performed. Inparticular, in some implementations, the radius of the ball can beselected (e.g., randomly sampled) from a novel distribution of availableradii. For example, in some implementations, the distribution of radiican be a geometric series or other power-law step-size distribution. Insome implementations, the distribution of available radii can be basedon a user-defined resolution term. For example, in some implementations,the distribution of available radii has a minimum equal to theuser-defined resolution term. In some implementations, the distributionof available radii has a maximum that is based at least in part on adiameter of a feasible set of values for the one or more adjustableparameters.

According to another aspect, the selection from the ball (e.g., therandom sample from the ball) can be projected onto the feasible set ofvalues for the one or more adjustable parameters. Thus, if the selectionfrom the ball (e.g., the random sample from the ball) is not includedwithin the available space of parameter values, then it can be projectedback into the space to provide a suggested set of values for evaluation.Thus, for the ball sampling technique, the projection of the selectionfrom within the ball onto the feasible parameter space can be output asthe suggestion to be evaluated.

Put in other terms, the Gradientless Descent technique for black boxoptimization of parameters of a system, product, or process can includeperforming one or more iterations of a sequence of operations and, aftercompletion of a final iteration of the sequence, outputting values ofthe parameters defined by a current best argument value for use inconfiguration of the system, formulation of the product or execution ofthe process.

The sequence of operations can include: a) determining whether to samplean argument value from a feasible set of argument values using a firstapproach (also referred to as random sampling) or a second approach(also referred to as ball sampling), where each argument value of thefeasible set defines values for each of plural parameters of a system,product, or process; b) based on the determination, sampling theargument value using the first (random sampling) approach or the second(ball sampling) approach, wherein the first approach includes samplingthe argument value at random from the entire feasible set and the secondapproach includes sampling the argument value from a subset of thefeasible set that is defined based on a ball around a current bestargument value; c) determining whether a performance measure of thesystem, product, or process that has been determined using parametersdefined by the sampled argument value is closer-to-optimal than acurrent closest-to-optimal performance measure; and d) if theperformance measure of the system is closer-to-optimal than the currentclosest-to-optimal performance measure, updating the current bestargument value based on the sampled argument value.

In some implementations, the ball may be defined by a radius that isselected from a geometric series of possible radii. In some of suchimplementations, the radius of the ball may be selected at random fromthe geometric series of radii. In addition or alternatively, an upperlimit on the geometric series of radii may be dependent on the diameterof the dataset, a resolution of the dataset, and/or the dimensionalityof the objective function.

The determination between the sampling of the argument value using thefirst approach and the sampling of the argument value using the secondapproach may be performed probabilistically (or may have an associatedprobability mass function). In addition or alternatively, sampling theargument value using the second approach can include determining anargument value from a space bounded by a ball around a current bestargument value, and projecting the determined argument value onto thefeasible set of argument values, thereby to obtain the sampled argumentvalue.

Thus, the present disclosure provides a computer system that canimplement one or more black-box optimization techniques to iterativelysuggest new parameter values to evaluate in order to optimize theperformance of a system. Many advanced features and particularapplications have been introduced and will be described further below.In addition, the present disclosure provides a novel optimizationtechnique and includes mathematical and practical evaluation of thenovel technique.

With reference now to the Figures, example embodiments of the presentdisclosure will be discussed in further detail.

1. EXAMPLE TERMS

Throughout the present disclosure, the following example terms are usedto describe the semantics of the system:

A Trial is a list of parameter values, x, that will lead to a singleevaluation of ƒ(x). A trial can be “Completed”, which means that it hasbeen evaluated and the objective value ƒ(x) has been assigned to it,otherwise it is “Pending”. Thus, a trial can correspond to an evaluationthat provides an associated measure of performance of a system given aparticular set of parameter values.

In some instances, a Trial can also be referred to as an experiment inthe sense that the evaluation of a list of parameter values, x, can beviewed as a single experiment regarding the performance of the system.This usage should not be confused however, with application of thesystems and methods described herein to optimize the adjustableparameters of an experiment such as a scientific experiment.

A Study represents a single optimization run over a feasible space. EachStudy contains a configuration describing the feasible space, as well asa set of Trials. It is assumed that ƒ(x) does not change in the courseof a Study.

A Worker can refer to a process responsible for evaluating a PendingTrial and calculating its objective value. Such processes can beperformed by “worker computing device(s)”.

These example terms are used for simplicity and for the purposes ofillustrating example aspects of the present disclosure. Other termscould be used instead and the present disclosure is limited to neitherthe particular example terms nor their explanations provided above.

2. OVERVIEW OF EXAMPLE SYSTEM

Implementing black-box optimization as a service can involve severaldesign considerations, examples of which will be provided below.

2.1 Example Design Objectives and Constraints

The parameter optimization system of the present disclosure satisfiesthe following desiderata:

-   -   Ease of use: Minimal user configuration and setup;    -   Hosts state-of-the-art black-box optimization algorithms;    -   High availability;    -   Scalable to millions of trials per study, thousands of parallel        trial evaluations per    -   study, and billions of studies;    -   Easy to experiment with new algorithms; and    -   Easy to change out algorithms deployed in production.

In some implementations of the present disclosure, the parameteroptimization system of the present disclosure can be implemented as amanaged service that stores the state of each optimization. Thisapproach drastically reduces the effort a new user needs to get up andrunning; and a managed service with a well-documented and stable RPC APIallows the service to be upgraded without user effort. A defaultconfiguration option can be provided for the managed service that isgood enough to ensure that most users need never concern themselves withthe underlying optimization algorithms.

The use of a default option can allow the service to dynamically selecta recommended black-box algorithm along with low-level settings based onthe study configuration. The algorithms can be made stateless, so thatthe system can seamlessly switch between algorithms during a study, ifand when the system determines that a different algorithm is likely toperform better for a particular study. For example, Gaussian ProcessBandits provide excellent result quality (see, e.g., Snoek et al. 2012.Practical Bayesian optimization of machine learning algorithms. InAdvances in neural information processing systems. 2951-2959; andSrinivas et al. 2010. Gaussian Process Optimization in the BanditSetting: No Regret and Experimental Design. ICML (2010)), but naiveimplementations scale as O(n³) with the number of training points. Thus,once a large number of completed Trials have been collected, the systemcan switch (e.g., automatically or in response to a user input) to usinga more scalable algorithm.

At the same time, it is desirable to allow the freedom to experimentwith new algorithms or special-case modifications of the supportedalgorithms in a manner that is safe, easy, and fast. Towards these ends,the present disclosure can be built as a modular system consisting offour cooperating processes (see, e.g., FIG. 1 which is described infurther detail below) that update the state of Studies in the centraldatabase. The processes themselves are modular with several cleanabstraction layers that allow experimenting with and applying differentalgorithms easily.

Finally, it is desirable to allow multiple trials to be evaluated inparallel, and allow for the possibility that each trial evaluation coulditself be a distributed process. To this end Workers can be defined,which can be responsible for evaluating suggestions, and can beidentified by a persistent name (a worker_handle) that persists acrossprocess preemptions or crashes.

2.2 Example Basic User Workflow

A developer may use one of the client libraries of the parameteroptimization system of the present disclosure implemented in multipleprogramming languages (e.g. C++, Python, Golang, etc.), which cangenerate service requests encoded as protocol buffers (see, e.g.,Google. 2017 b. Protocol Buffers: Google's data interchange format.https://github.com/google/protobuf. (2017). [Online]). The basicworkflow is extremely straight-forward. Users can specify a studyconfiguration indicating:

-   -   Identifying characteristics of the study (e.g. name, owner,        permissions);    -   The set of parameters along with feasible sets for each (c.f.,        Section 2.3.1 for details); and    -   Whether the goal is minimization or maximization of the        objective function.

Given this configuration, in some implementations, basic use of thesystem (with each trial being evaluated by a single process) can beimplemented as follows:

# Register this client with the Study, creating it if necessary.client.LoadStudy(study_config, worker_handle) while (notclient.StudyIsDone( )):    # Obtain a trial to evaluate.    trial =client.GetSuggestion( )    # Evaluate the trial parameters.    metrics =RunTrial(trial)    # Report back the results.   client.CompleteTrial(trial, metrics)

In some instances, a “client” can refer to or include a communicationspath to the parameter optimization system and a “worker” can refer to aprocess that evaluates a Trial. In some instances, each worker has or isa client. Thus, the phrase #Register this client with the Study,creating it if necessary could also be true if “client” was replaced by“worker.” In addition, in some implementations, a copy of the “while”loop from the above example pseudocode is typically running on eachworker, of which there could be any number (e.g., 1000 workers).

Further, as used in the above example pseudocode, RunTrial is theproblem-specific evaluation of the objective function ƒ. Multiple namedmetrics may be reported back to the parameter optimization system of thepresent disclosure, however one metric (or some defined combination ofthe metrics) should be distinguished as the objective value ƒ(x) fortrial x. Note that multiple processes working on a study could share thesame worker_handle if they are collaboratively evaluating the sametrial. Processes registered with a given study with the sameworker_handle can receive the same trial upon request, which enablesdistributed trial evaluation.

2.3 Example Interfaces

2.3.1 Configuring a Study

To configure a study, the user can provide a study name, owner, optionalaccess permissions, an optimization goal from MAXIMIZE, MINIMIZE, andspecify the feasible region X via a set of ParameterConfigs, each ofwhich specifies a parameter name along with its feasible values. Forinstance, the following parameter types can be supported:

-   -   DOUBLE: The feasible region can be a closed interval [a, b] for        some real values a≤b.    -   INTEGER: The feasible region can have the form [a, b]∩        for some integers a≤b.    -   DISCRETE: The feasible region can be an explicitly specified set        of real numbers. In some implementations, the set of real        numbers can be “ordered” in the sense that they are treated        differently (e.g., by the optimization algorithms) than        categorical features. For example, an optimization algorithm        might be able to leverage the fact that 0.2 is between 0.1 and        0.3 in a fashion that is generally not applicable to unordered        categories. However, there is no requirement that the set of        real numbers be supplied in any particular order or assigned any        particular ordering.    -   CATEGORICAL: The feasible region can be an explicitly specified,        unordered set of strings.

Users may also suggest recommended scaling, e.g., logarithmic scalingfor parameters for which the objective may depend only on the order ofmagnitude of a parameter value.

2.3.2 Example API Definition

Workers and end users can make calls to the parameter optimizationsystem of the present disclosure using, for example, a REST API or usingan internal RPC protocol (see., e.g., Google. 2017 b. Protocol Buffers:Google's data interchange format. https://github.com/google/protobuf.(2017). [Online]).

For instance, some example system calls are:

-   -   CreateStudy: Given a Study configuration, this can create an        optimization Study and return a globally unique identifier        (“guid”) which can be used for all future system calls. If a        Study with a matching name exists, the guid for that Study is        returned. This can allow parallel workers to call this method        and all register with the same Study.    -   SuggestTrials: This method can take a “worker handle” as input,        and return a globally unique handle for a “long-running        operation” that can represent the work of generating Trial        suggestions. The user can then poll the API periodically to        check the status of the operation. Once the operation is        completed, it can contain the suggested Trials. This design can        ensure that all system calls are made with low latency, while        allowing for the fact that the generation of Trials can take        longer.    -   AddMeasurementToTrial: This method can allow clients to provide        intermediate metrics during the evaluation of a Trial. These        metrics can then be used by the Automated Stopping rules to        determine which Trials should be stopped early.    -   CompleteTrial: This method can change a Trial's status to        “Completed”, and can provide a final objective value that can        then be used to inform the suggestions provided by future calls        to SuggestTrials.    -   ShouldTrialStop: This method can return a globally unique handle        for a long-running operation that can represent the work of        determining whether a Pending Trial should be stopped.

2.4 Example Infrastructure

FIG. 1 depicts an example computing system architecture that can be usedby the parameter optimization system of the present disclosure. Inparticular, the main components include (1) a Dangling Work Finder thatrestarts work lost to preemptions; (2) a Persistent Database that holdsthe current state of all Studies; (3) a Suggestion Service that createsnew Trials; (4) an Early Stopping Service that helps terminate a Trialearly; (5) a System API that can perform, for example, JSON, validation,multiplexing, etc.; and (6) Evaluation Workers. In some implementations,the Evaluation Workers can be provided and/or owned by the user.

2.4.1 Example Parallel Processing of Suggestion Work

In some implementations, the parameter optimization system of thepresent disclosure can be used to generate suggestions for a largenumber of Studies concurrently. As such, a single machine can beinsufficient for handling all the workload of the system. The SuggestionService can therefore be partitioned across several datacenters, with anumber of machines being used in each one. Each instance of theSuggestion Service potentially can generate suggestions for severalStudies in parallel, giving us a massively scalable suggestioninfrastructure. A load balancing infrastructure can then be used toallow clients to make calls to a unified endpoint, without needing toknow which instance is doing the work.

When a request is received by a Suggestion Service instance to generatesuggestions, the instance can first place a distributed lock on theStudy, which can ensure that work on the Study is not duplicated bymultiple instances. This lock can be acquired for a fixed period oftime, and can periodically be extended by a separate thread running onthe instance. In other words, the lock can be held until either theinstance fails, or it decides it's done working on the Study. If theinstance fails (due to e.g. hardware failure, job preemption, etc.), thelock can expire, making it eligible to be picked up by a separateprocess (called the “DanglingWorkFinder”) which can then reassign theStudy to a different Suggestion Service instance.

One consideration in maintaining a production system is that bugs areinevitably introduced as code matures. There are times when a newalgorithmic change, however well tested, can lead to instances of theSuggestion Service failing for particular Studies. If a Study is pickedup by the DanglingWorkFinder too many times, it can detect this,temporarily halt the Study, and alert an operator to the crashes. Thiscan help prevent subtle bugs that only affect a few Studies from causingcrash loops that can affect the overall stability of the system.

2.5 Example Algorithm Playground

In some implementations, the parameter optimization system of thepresent disclosure can include an algorithm playground which can providea mechanism for advanced users to easily, quickly, and safely replacethe core optimization algorithms internally supported by the parameteroptimization system with arbitrary algorithms.

The playground can serve a dual purpose; it can allow rapid prototypingof new algorithms, and it can allow power-users to easily customize theparameter optimization system of the present disclosure with advanced orexotic capabilities that can be particular to a use-case. Thus, users ofthe playground can benefit from all of the infrastructure of theparameter optimization system aside from the core algorithms, such asaccess to a persistent database of Trials, the dashboard, and/orvisualizations.

One central aspect of the playground is the ability to inject Trialsinto a Study. The parameter optimization system of the presentdisclosure can allow the user or other authorized processes to requestone or more particular Trials to be evaluated. In some embodiments, theparameter optimization system of the present disclosure may not suggestany Trials for evaluation, but can rely on an external binary togenerate Trials for evaluation, which can then be pushed to the systemfor later distribution to the workers.

In one example embodiment, the architecture of the Playground caninvolve the following key components: System API, Custom Policy,Abstract Policy, Playground Binary, and Evaluation Workers.

In particular, FIG. 2 depicts a block diagram of an example computingsystem architecture that can be used to implement the AlgorithmPlayground. The main components include: (1) a System API that takesservice requests; (2) a Custom Policy that implements the AbstractPolicy and generates suggested Trials; (3) a Playground Binary thatdrives the Custom Policy based on demand reported by the System API; and(4) the Evaluation Workers that behave as normal, such as, requestingand evaluating Trials.

-   -   The Abstract Policy can include two abstract methods:    -   1. GetNewSuggestions(trials, num_suggestions); and    -   2. GetEarlyStoppingTrials(trials).

The two abstract methods can be implemented by the user's custom policy.Both these methods can be stateless and at each invocation take the fullstate of all trials in the database, though stateful implementations arewithin the scope of the present disclosure. GetNewSuggestions cangenerate, for example, num_suggestions number of new trials, while theGetEarlyStoppingTrials method can return a list of Pending Trials thatshould be stopped early. The custom policy can be registered with thePlayground Binary which can communicate with the System API using afixed polling schedule. The Evaluation Workers can maintain the serviceabstraction and can be unaware of the existence of the Playground.

2.6 Example Benchmarking Suite

The parameter optimization system of the present disclosure can includean integrated framework that enable efficiently benchmarking of thesupported algorithms on a variety of objective functions. Many of theobjective functions come from the Black-Box Optimization BenchmarkingWorkshop (see Finck et al. 2009. Real-Parameter Black-Box OptimizationBenchmarking 2009: Presentation of the Noiseless Functions.http://coco.gforge.inria.fr/lib/exe/fetch.php?media=download3.6:bbobdocfunctions.pdf.(2009). [Online]), but the framework allows for any function to bemodeled by implementing an abstract Experimenter class, which can have avirtual method responsible for calculating the objective value for agiven Trial, and a second virtual method that can return the optimalsolution for that benchmark.

Users can configure a set of benchmark runs by providing a set ofalgorithm configurations and a set of objective functions. Thebenchmarking suite can optimize each function with each algorithm ktimes (where k is configurable), producing a series ofperformance-over-time metrics which can then be formatted afterexecution. The individual runs can be distributed over multiple threadsand multiple machines, so it is easy to have thousands or more ofbenchmark runs being executed in parallel.

2.7 Example Dashboard and Visualizations

The parameter optimization system of the present disclosure can includea web dashboard which can be used for monitoring/or and changing thestate of Studies. The dashboard can be fully featured and can implementthe full functionality of the parameter optimization system API. Thedashboard can also be used for: (1) Tracking the progress of a study.(2) Interactive visualizations. (3) Creating, updating and deleting astudy. (4) Requesting new suggestions, early stopping,activating/deactivating a study. See FIG. 3 for a section of thedashboard. In addition to monitoring and visualizations, the dashboardcan contain action buttons such as Get Suggestions.

In particular, FIG. 3 depicts a section of the dashboard for trackingthe progress of Trials and the corresponding objective function values.As illustrated, the dashboard also includes actions buttons such as “GetSuggestions” for manually requesting suggestions.

In some implementations, the dashboard can include a translation layerwhich can convert between JSON and protocol buffers when talking withbackend servers (see, e.g., Google. 2017 b. Protocol Buffers: Google'sdata interchange format. https://github.com/google/protobuf. (2017).[Online]). In some implementations, the dashboard can be built with anopen source web framework such as Polymer using web components and canuse material design principles (see, e.g., Google. 2017 a. Polymer:Build modern apps. https://github.com/Polymer/polymer. (2017).[Online]). In some implementations, the dashboard can containinteractive visualizations for analyzing the parameters of a study. Forinstance, a visualization can be used which is easily scalable to highdimensional spaces (e.g., 15 dimensions or more) and works with bothnumerical and categorical parameters.

One example of such a visualization is the parallel coordinatesvisualization. See, e.g., Heinrich and Weiskopf. 2013. State of the Artof Parallel Coordinates. In Eurographics (STARs). 95-116.

See FIG. 4 for an example parallel coordinates visualization provided bythe parameter optimization system. In one embodiment, each vertical axiscan be a dimension corresponding to a parameter, whereas each horizontalline can be an individual trial. The point at which the horizontal lineintersects the vertical axis can indicate the value of the parameter inthat dimension. This can be used for examining how the dimensionsco-vary with each other and also against the objective function value.In some implementations, the visualizations can be built using d3.js(see, e.g., Bostock et al. 2011. D³ data-driven documents. IEEEtransactions on visualization and computer graphics 17, 12 (2011),2301-2309).

In particular, FIG. 4 depicts an example parallel coordinatesvisualization that can be used for examining results from differentruns. The parallel coordinates visualization has the benefit of scalingto high dimensional spaces (e.g., ˜15 dimensions) and works with bothnumerical and categorical parameters. Additionally, it can beinteractive and can allow various modes of separating, combining, and/orcomparing data.

3. EXAMPLE PARAMETER OPTIMIZATION ALGORITHMS

The parameter optimization system of the present disclosure can beimplemented using a modular design which can allow the user to easilysupport multiple algorithms. In some implementations, for studies withunder a thousand trials, the parameter optimization system of thepresent disclosure can default to using Batched Gaussian Process Bandits(see Desautels et al. 2014. Parallelizing exploration-exploitationtradeoffs in Gaussian process bandit optimization. Journal of MachineLearning Research 15, 1 (2014), 3873-3923). For example, in someimplementations, a Matén kernel with automatic relevance determination(see e.g., section 5.1 of Rasmussen and Williams. 2005. GaussianProcesses for Machine Learning (Adaptive Computation and MachineLearning). The MIT Press. for a discussion) and the expected improvementacquisition function (see Moc'kus et al. 1978. The Application ofBayesian Methods for Seeking the Extremum. Vol. 2. Elsevier. pages117-128) can be used. In one embodiment, local maxima of the acquisitionfunction can be found with a proprietary gradient-free hill climbingalgorithm with random starting points.

In one embodiment, discrete parameters can be incorporated by embeddingthem in

Categorical parameters with k feasible values can be represented viaone-hot encoding, i.e., embedded in [0,1]^(k). The Gaussian Processregressor can provide continuous and differentiable function upon whichwe can walk uphill, then when the walk has converged, round to thenearest feasible point.

In some implementations, Bayesian deep learning models can be used inlieu of Gaussian processes for scalability.

For studies with tens of thousands of trials or more, other algorithmsmay be used. RandomSearch and GridSearch are supported as first-classchoices and may be used in this regime, and many other publishedalgorithms are supported through the algorithm playground. In addition,the example “Gradientless Descent” algorithm described herein and/orvariations thereof can be used under these or other conditions insteadof the more typical algorithms such as RandomSearch or GridSearch.

For all of these algorithms data normalization can be supported, whichcan map numeric parameter values into [0,1] and labels onto [−0.5,0.5].Depending on the problem, a one-to-one nonlinear mapping may be used forsome of the parameters, and is typically used on the labels. Datanormalization can be handled before trials are presented to the trialsuggestion algorithms, and its suggestions can be transparently mappedback to the user-specified scaling.

3.1 Example Automated Early Stopping

In some important applications of black-box optimization, informationrelated to the performance of a trial may become available during trialevaluation. For example, this may take the form of intermediate results.If sufficiently poor, these intermediate results can be used toterminate a trial or evaluation early, thereby saving resources.

Perhaps the best example of such a performance curve occurs when tuningmachine learning hyperparameters for models trained progressively (e.g.,via some version of stochastic gradient descent). In this case, themodel (e.g., as represented by a sequence of trained models) typicallybecomes more accurate as it trains on more data, and the accuracy of themodel is available at the end of each training epoch. Using theseaccuracy vs. training step curves, it is often possible to determinethat a trial's parameter settings are unpromising well before evaluationis finished. In this case trial evaluation can be terminated early,freeing those evaluation resources for more promising trial parameters.When done algorithmically, this is referred to as automated earlystopping.

The parameter optimization system of the present disclosure can supportautomated early stopping via an API call to a ShouldTrialStop method.There can be an Automated Stopping Service similar to the SuggestionService that can accept requests from the system API to analyze a studyand determine the set of trials that should be stopped, according to theconfigured early stopping algorithm. As with suggestion algorithms,several automated early stopping algorithms can be supported, and rapidprototyping can be done via the algorithm playground.

3.2 Example Performance Curve Stopping Rule

As one example early stopping algorithm, the parameter optimizationsystem of the present disclosure can support a new automated earlystopping rule that is based on non-parametric regression (e.g., GaussianProcess regression) with a carefully designed inductive bias. Thisstopping algorithm can work in a stateless fashion. For example, it canbe given the full state of all trials in the Study when determiningwhich trials should stop. The parameter optimization system of thepresent disclosure can also optionally support any additional earlystopping algorithms beyond the example performance curve rule describedbelow.

This stopping rule can perform regression on the performance curves tomake a prediction of the final objective value of a Trial given a set ofTrials that are already Completed, and a partial performance curve(i.e., a set of measurements taken during Trial evaluation). Given thisprediction, in some implementations, if the probability of exceeding theoptimal value found thus far is sufficiently low, early stopping can berequested for the Trial.

While prior work on automated early stopping used Bayesian parametricregression (see, e.g., Domhan et al. 2015. Speeding Up AutomaticHyperparameter Optimization of Deep Neural Networks by Extrapolation ofLearning Curves. In IJCAI. 3460-3468; and Swersky et al. 2014.Freeze-thaw Bayesian optimization. arXiv preprint arXiv:1406.3896(2014)), according to an aspect of the present disclosure, a Bayesiannon-parametric regression can also be used, such as a Gaussian processmodel with a carefully designed kernel that measures similarity betweenperformance curves. Such an algorithm can be robust to many kinds ofperformance curves, including those coming from applications other thantuning machine learning hyperparameters in which the performance curvesmay have very different semantics. Notably, this stopping rule can stillwork well even when the performance curve is not measuring the samequantity as the objective value, but is merely predictive of it.

3.2.1 Regression on Performance Curves

Gaussian Processes provide flexible non-parameteric regression, withpriors specified via a kernel function k. Given input parameters in Xand performance curves in

(which encode the objective value over time, and may formally be thoughtof as sets of (time, objective value) pairs), we take the label of atrial to be its final performance.

Let

=X×

be the domain of trials along with their performance curves. Theexisting data can be regressed upon for a partially completed trial witha suitable kernel function k:

×

→

.

Swersky et al. take a parameteric approach, developing a kernel that istailored to exponentially decaying performance (in their words “stronglysupports exponentially decaying functions”) (2014. Freeze-thaw Bayesianoptimization. arXiv preprint arXiv:1406.3896 (2014)). In contrast, theexample performance curve stopping rule provided by the presentdisclosure takes a non-parameteric approach. This approach is based onobservations that, in practice, performance curves are typically notexponentially decaying, and may be hard to parameterize with fewparameters.

Specifically, given a distance metric on performance curves, δ:

×

→

, and an isotrophic kernel function k(x,x′)=κ(∥x−x′∥) where ∥⋅∥ is anorm, it can be augmented to handle the performance curves via

k((x,c),(x′,c′))=κ(∥x−x′∥+δ(c,c′))

Examples for κ include the familiar Gaussian and Matérn kernelfunctions. A reasonable choice for δ may also smooth out the performancecurves, and may include its own kernel hyperparameters such as alength-scale.

3.2.2 Deciding when to Stop Early

Given good estimates of the final objective value, together withconfidence intervals enables the termination of unpromising trials,e.g., those whose probability of exceeding the best trial result yetfound is below a set threshold.

This method also allows for automatic convergence detection: If theconfidence interval of the final objective value both contains the morerecent intermediate result (possibly after smoothing) and/or is smallerin width than a fixed threshold, then the trial can be declared asconverged and terminated.

3.2.3 Regressing on Transformed Data

Clearly, the effectiveness of this early stopping rule depends on thequality of the predictions of the final objective value for an ongoingtrial with intermediate results. Encoding good priors to ensure theright kinds of inductive bias is important.

To improve the regression in practice, start with an informalobservation: When training models with different hyperparameter values,the performance curves are often similar in shape, but offset from oneanother.

More precisely, suppose (smoothed) performance curves c₁, c₂, . . .c_(n) are given from models trained with different hyperparametervalues, each of which is a function c: [T]→

where [n]:={1, 2, . . . , n} and T∈

is the maximum training time for any model. Also define y_(i):=c_(i)(T)to be the label of the i^(th) data point.

Let c∈

be a performance curve. For convenience, the norm ∥c∥ is defined to bethe ∥⋅∥ norm of the vector (c(1), c(2), . . . , c(T)). Let 1 be theconstant function (∀t, 1(t)=1) and let ∥⋅∥₂ be the familiar Euclideannorm. Then, for most i,j pairs, min_(α)∥c_(i)−c_(j)−α1∥₂ is small, andin particular that c_(i)(T)−α* is an informative estimate for c_(j)(T),where α* is the argmin of the previous expression.

This observation can be exploited to improve predictive performance. Asa toy example to illustrate one possible method of execution, suppose wehave constants α_(i) sampled from

(0,1), and performance curves c_(i) with c_(i)(t) sampled fromc_(i)(t)˜α_(i)+

(0,σ²), with σ²<<1. Given a test data point with partial performancecurve c_(*) that looks roughly constant with mean value α_(*), the taskis to predict a final objective value of α_(*).

This prediction can be accomplished by regressing over transformed data{φ(c_(i))}_(i≥0), and then inverse-transforming the regressed value. Asa concrete example, consider

φ(c)=argmin{∥c′∥ ₂ :c′=c−α1,α∈

}

which is equivalent to

${\varphi(c)} = {c - {1\left( {\frac{1}{T}{\sum\limits_{t = 1}^{T}{c(t)}}} \right)}}$

Let the adjustment of c relative to φ be defined as

${\alpha(c)}:={\frac{1}{T}{\sum_{t = 1}^{T}{{c(t)}.}}}$

Then Gaussian Process regression can be performed on the adjusted data{(x_(i),φ(c_(i)))}_(i≥0), where recall the label is the last point inthe (now adjusted) performance curve. That is, the i^(th) label can betranslated by −α(c_(i)). Given a posterior prediction of y_(*) for anadjusted test point (x_(*), φ(c_(*))), then y_(*)+α(c_(*)) is ultimatelypredicted.

3.3 Example Transfer Learning

A valuable feature for black-box optimization tasks is to avoid doingrepeated work. Often, users run a study that might be similar to studiesthey have run before. The parameter optimization system of the presentdisclosure can support a form of Transfer Learning which can allow usersto leverage data from prior studies to guide and accelerate theircurrent study. For instance, one might tune the learning rate andregularization of a machine learning system, then use that Study as aprior to tune the same ML system on a different data set.

One example approach to transfer learning provided by the presentdisclosure is relatively simple, yet robust to changes in objectiveacross studies. In particular, the transfer learning approach providedby the present disclosure: scales well to situations where there aremany prior studies; effectively accelerates studies (e.g., achievesbetter results with fewer trials) when the priors are good, particularlyin cases where the location of the optimal, x*, doesn't change much(e.g., doesn't change much between the prior Study and the currentStudy); and is robust against uninformative prior studies; and sharesinformation even when there is no formally expressible relationshipbetween the prior and current Studies.

In previous work on transfer learning in the context of hyperparameteroptimization, Bardenet et al. discuss the difficulty in transferringknowledge across different datasets especially when the observed metricsand the sampling of the datasets are different (see Collaborativehyperparameter tuning. ICML 2 (2013), 199). They use a ranking approachfor constructing a surrogate model for the response surface. Thisapproach suffers from the computational overhead of running a rankingalgorithm. Yogatama and Mann propose a more efficient approach, whichscales as Θ(kn+n³) for k studies of n trials each, where the cubic termcomes from using a Gaussian process in their acquisition function (seeEfficient Transfer Learning Method for Automatic Hyperparameter Tuning.JMLR: W&CP 33 (2014), 1077-1085).

One natural approach to implementing transfer learning might be to builda larger Gaussian Process regressor that is trained on both the prior(s)and the current Study. However that approach fails to satisfy the firstdesign criteria; For k studies with n trials each it would requireΩ(k³n³) time. Such an approach also requires one to specify or learnkernel functions that bridge between the prior(s) and current Study,which may result in poorly chosen inductive biases and reducing itseffectiveness.

Instead, one example approach provided by the present disclosure buildsa stack of Gaussian Process regressors, where each regressor isassociated with a study, and where each level is trained on theresiduals relative to the regressor below it. There can be a linearordering on all the studies that can put the current study at oneextreme. The studies can be performed sequentially, in which case theordering can be the temporal order in which the studies were performed.

Continuing with this example approach, the bottom of the stack cancontain a regressor built using data from the oldest study in the stack.The regressor above it can be associated with the 2nd oldest study, andcan regress on the residual labels of its data with respect to thepredictions of the regressor below it. Similarly the regressorassociated with the i^(th) study can be built using the data from thatstudy, and can regress on the residual labels with respect to thepredictions of the regressor below it.

More formally, a sequence of studies S_(i=1:k) on unknown objectivefunctions ƒ_(i=1:k) are available, where the current study is S_(k), andwe build two sequences of regressors R_(i=1:k) and R′_(i=1:k) havingposterior mean functions μ_(i=1:k) and μ′_(i=1:k) respectively, andposterior standard deviation functions σ_(i=1:k) and σ′_(i=1:k),respectively. The final predictions will be μ_(k) and σ_(k).

Let D_(i)=(x_(t) ^(i),y_(t) ^(i)) be the dataset for study S_(i). LetR′_(i) be a regressor trained using data ((x_(t) ^(i),y_(t)^(i)−μ_(i-1)(x_(t) ^(i))) which computes μ′_(i) and σ′_(i). Let μ₁ andσ_(i) be derived from a regressor without a prior which is trained on D₁directly, rather than the more complex form which subtracts μ from y.Then the posterior means at level i is defined asμ_(i)(x):=μ′_(i)(x)+μ_(i-1)(x). The posterior standard deviations atlevel i, σ_(i)(x), is taken to be a weighted geometric mean of σ′_(i)(x)and σ_(i-1)(x), where the weights are a function of the amount of data(i.e., completed trials) in S_(i) and S_(i-1). The exact weightingfunction depends on a constant α≈1 that sets the relative importance ofold and new standard deviations.

Details are provided in the pseudocode in Algorithm 1, and exampleregressors are illustrated in FIG. 5 . In particular, FIG. 5 is anexample illustration of the transfer learning scheme provided by thepresent disclosure, showing how μ′_(i) is built from the residual labelswith respect to μ_(i-1) (shown in dotted lines).

Algorithm 1 Transfer Learning Regressor  1: # Returns a function Rx_(test), which returns (μ, σ)  2: function GetRegressor(D_(training),i)  3:   If i == 0: Return TrainGP(D₀)  4:   # Recurse to get aRegressor (μ_(i−1)(x), σ_(i−1)(x)) trained on  5:   # the data for alllevels of the stack below this one.  6:   R_(prior) ←GetRegressor(D_(training), i − 1)  7:   # Compute training residuals  8:  D_(residuals) ← [(x, y − R_(prior)(x)[0])for(x, y) ∈ D_(i)]  9:   #Train a Gaussian Process (μ′_(i)(x), σ′_(i)(x)) on the residuals. 10:  GP_(residuals) = TrainGP(D_(residuals)) 11:   functionMyRegressor(x_(test)) 12:     μ_(prior), σ_(prior) ← R_(prior)(x_(test))13:     μ_(top), σ_(top) ← GP_(residuals)(x_(test)) 14:     μ ←μ_(top) + μ_(prior) 15:     β ← α|D_(i)|/(α|D_(i)| + |D_(i−1)|) 16:    σ ← σ_(top) ^(β)σ_(prior) ^(1−β) 17:     return μ, σ 18:   endfunction 19:   return MyRegressor 20: end function

Algorithm 1 has the property that for a sufficiently dense sampling ofthe feasible region in the training data for the current study, thepredictions converge to those of a regressor trained only on the currentstudy data. This ensures a certain degree of robustness: badly chosenpriors will not harm the prediction asymptotically. In Algorithm 1, thenotation R_(prior)(x)[0] indicates to compute the predicted mean withR_(prior) at x, and report the mean.

In production settings, transfer learning is often particularly valuablewhen the number of trials per study is relatively small, but there maysuch studies. For example, certain production machine learning systemsmay be very expensive to train, limiting the number of trials that canbe run for hyperparameter tuning, yet are mission critical for abusiness and are thus worked on year after year. Over time, the totalnumber of trials spanning several small hyperparameter tuning runs canbe quite informative. The example transfer learning scheme isparticularly well-suited to this case; Also see section 4.3.

4. EXAMPLE RESULTS

4.1 Example Performance Evaluation

In order to evaluate the performance of the parameter optimizationsystem of the present disclosure functions are required that can be usedto benchmark the results. That is, pre-selected, easily calculatedfunctions with known optimal points that have proven challenging forblack-box optimizers. The success of an optimizer on some benchmarkfunction ƒ can then be measured by its final optimality gap. That is, ifx* is any argument minimizing ƒ( ), and {circumflex over (x)} is thebest solution found by the optimization algorithm, then the quality ofthe results can be measured by |ƒ({circumflex over (x)})−ƒ(x*)|. If, asis frequently the case, the optimization method has a stochasticcomponent, the average optimality gap can be calculated by averagingover multiple runs of the optimization function on the benchmark.

Comparing between benchmarks is a little more difficult given thevariation in region size and difficulty. For example, a good black-boxoptimizer applied to the Rastrigin function (see, e.g., Finck et al.2009. Real-Parameter Black-Box Optimization Benchmarking 2009:Presentation of the Noiseless Functions.http://coco.gforge.inria.fr/lib/exe/fetch.php?media=download3.6:bbobdocfunctions.pdf.(2009). [Online]) might come within 160 of optima with some effort,while a random selection of samplings for the Beale function can quicklyget to a point within 60 of optimal. Hence normalization is necessary.One example normalizing factor is the performance of Random Search,quantifying how much each algorithm improves over random sampling. Oncenormalized, the results can be averaged over the benchmarks to get asingle value representing average improvement over random sampling.

The benchmarks selected were primarily taken from the Black-BoxOptimization Benchmarking Workshop (see Finck et al. 2009.Real-Parameter Black-Box Optimization Benchmarking 2009: Presentation ofthe Noiseless Functions.http://coco.gforge.inria.fr/lib/exe/fetch.php?media=download3.6:bbobdocfunctions.pdf.(2009). [Online]) (an academic competition for black-box optimizationtools), and include, for example, the Beale, Branin, Elliposidal,Rastrigin, Rosenbrock, Six Hump Camel, Sphere, and Styblinski benchmarkfunctions. Note that all functions are formulated as minimizationproblems.

4.2 Example Empirical Results

In FIG. 6 result quality is compared for three example algorithmscurrently implemented in the framework: a spectral Gaussian processimplementation (Quiñonero-Candela et al. 2010. Sparse spectrum Gaussianprocess regression. Journal of Machine Learning Research 11, Jun.(2010), 1865-1881), the SMAC algorithm (Hutter et al. 2011. Sequentialmodel-based optimization for general algorithm configuration. InInternational Conference on Learning and Intelligent Optimization.Springer, 507-523), and a probabilistic search method provided by thepresent disclosure.

In particular, FIG. 6 provides a ratio of the average optimality gap foreach algorithm to that of the Random Search at a given number ofsamples. The 2× Random Search is a Random Search allowed to sample twopoints at every step (as opposed to a single point for the otheralgorithms).

For a given dimension d, each benchmark function is generalized into a ddimensional space, each benchmark is run 100 times, and the intermediateresults are recorded (averaging these over the multiple runs). FIG. 6shows the results for dimensions 4, 8, 16, and 32 in terms theirimprovement over Random Search. For each plot, the horizontal axisrepresents the point in the algorithm where that number of trials havebeen evaluated, while the vertical access indicates the algorithmsoptimality gap as a fraction of the Random Search optimality gap at thesame point. The 2× Random Search curve is the Random Search algorithmwhen it was allowed to sample two points for every single point of theother samplers. While some authors have claimed that 2× Random Search ishighly competitive with Bayesian Optimization methods (see, e.g., Li etal. 2016. Hyperband: A Novel Bandit-Based Approach to HyperparameterOptimization. CoRR abs/1603.06560 (2016).http://arxiv.org/abs/1603.06560), the data provided herein suggests thisis only true when the dimensionality of the problem is sufficiently high(e.g., over 16).

4.3 Example Transfer Learning Results

The convergence of transfer learning is tested in a 10 dimensional spaceusing the 8 black-box functions described in section 4.1. Up to 180trials are run and transfer learning is applied every 6 trials using theprevious 6 as its prior, such that there are 30 linearly chainedstudies. Transfer learning is deemed to be converging if the optimalitygap shrinks with increasing trials. It is critical to note convergenceto the optimal is a difficult task since each study gets a budget ofonly 6 trials whilst operating in a 10 dimensional space.

FIG. 7 shows the convergence of two search algorithms, Gaussian ProcessBandits and Random Search, by comparing the log of the geometric mean ofthe optimality gap across all the black box functions. Note that the GPBandit shows steady progress towards the optimal when compared to RandomSearch thus demonstrating the effective transfer of knowledge from theearlier trials. Also, note the saw-tooth pattern as the X axis istraversed due to transfer learning being applied every 6 trials.

In particular, FIG. 7 illustrates the convergence of transfer learningin a 10 dimensional space using the 8 black-box functions described insection 4.1. Transfer learning is applied to every 6 trials using theprevious 6 as its prior. The X axis shows increasing trials whereas theY axis shows the log of the geometric mean of optimality gaps across allthe benchmarks. Note that GP bandits shows consistent decline inoptimality gap with increasing trials thus demonstrating effectivetransfer of knowledge from the earlier trials.

4.4 Example Automated Stopping Results

4.4.1 Example Performance Curve Stopping Rule Results

Through experimentation, the use of the performance curve stopping rulehas been shown to achieve optimality gaps comparable to those achievedwithout the stopping rule, while using approximately 50% fewer CPU-hourswhen tuning hyperparameter for deep neural networks. This result is inline with figures reported by other researchers, while using a moreflexible non-parametric model. For example, Domhan et al. reportreductions in the 40% to 60% range on three ML hyperparameter tuningbenchmarks (2015. Speeding Up Automatic Hyperparameter Optimization ofDeep Neural Networks by Extrapolation of Learning Curves. In IJCAI.3460-3468).

5. EXAMPLE USE CASES

The parameter optimization system of the present disclosure can be usedfor a number of different application domains.

5.1 Example Hyperparameter Tuning Use Case

The parameter optimization system of the present disclosure can be usedto optimize hyperparameters of machine learning models, both forresearch and production models. One implementation scales to service theentire hyperparameter tuning workload across Alphabet, which isextensive. As one (admittedly extreme) example, the parameteroptimization system of the present disclosure has proven capable ofperforming hyperparameter tuning studies that collectively containmillions of trials. In one example context, a single trial can involvetraining a distinct machine learning model using differenthyperparameter values. This would not be possible without effectiveblack-box optimization. For other research projects, automating thearduous and tedious task of hyperparameter tuning accelerates theirprogress.

Perhaps even more importantly, the parameter optimization system of thepresent disclosure has made notable improvements to production modelsunderlying many Google products, resulting in measurably better userexperiences for over a billion people.

5.2 Example Automated A/B Testing Use Case

In addition to tuning hyperparameters, the parameter optimization systemof the present disclosure can have a number of other uses. It can beused for automated A/B testing of web properties, for example tuninguser-interface parameters such as font and thumbnail sizes, colorschema, and spacing, or traffic-serving parameters such as the relativeimportance of various signals in determining which items to show to auser. An example of the latter would be “how should the search resultsreturned from Google Maps trade off search-relevance for distance fromthe user?”

5.3 Example Physical Design or Logistical Problems Use Case

The parameter optimization system of the present disclosure can also beused to solve complex black-box optimization problems arising fromphysical design or logistical problems. More particularly, the parameteroptimization system can be employed to optimize the adjustableparameters (e.g., component or ingredient type or amount, productionorder, production timing) of a physical product or process of producinga physical product such as, for example, an alloy, a metamaterial, aconcrete mix, a process for pouring concrete, a drug cocktail, or aprocess for performing therapeutic treatment. Additional exampleapplications include optimization of physical systems (e.g., optimizingairfoils in simulation) or logistical problems.

More generally, the parameter optimization system and associatedtechniques provided herein can be applied to a wide variety of products,including any system, product, or process that can be specified by, forexample, a set of components and/or operating/processing parameters.Thus, in some implementations, the parameter optimization system can beused to perform optimization of products (e.g., personalized products)via automated experimental design.

5.4 Example Additional Capabilities

Additional capabilities of the system can include:

Infeasible trials: In real applications, some trials may be infeasible,meaning they cannot be evaluated for reasons that are intrinsic to theparameter settings. Very high learning rates may cause training todiverge, leading to garbage models.

Manual overrides of suggested trials: Sometimes the suggested trialcannot be evaluated or else a different trial might mistakenly beevaluated rather than the one asked for. For example, workflow,component availability, or other reasons can cause the evaluation of aparticular suggested trial to be impractical.

The parameter optimization system of the present disclosure can supportmarking trials as infeasible, in which case they do not receive anobjective value. In the case of Bayesian Optimization, previous workcan, for example, assign them a particularly bad objective value,attempt to incorporate a probability of infeasibility into theacquisition function to penalize points that are likely to be infeasible(see, e.g., Bernardo et al. 2011. Optimization under unknownconstraints. Bayesian Statistics 9 9 (2011), 229), or try to explicitlymodel the shape of the infeasible region (see, e.g., Gardner et al.2014. Bayesian Optimization with Inequality Constraints. In ICML.937-945; and Gelbart et al. 2014. Bayesian optimization with unknownconstraints. In Proceedings of the Thirtieth Conference on Uncertaintyin Artifcial Intelligence. AUAI Press, 250-259).

One example implementation of the present disclosure takes the firstapproach, which is simple and fairly effective for the applicationsconsidered. Regarding manual overrides, the parameter optimizationsystem of the present disclosure can include a stateless design thatenables it to support updating or deleting trials; for instance, thetrial state can simply be updated on the database.

6. INTRODUCTION TO GRADIENTLESS DESCENT

The present disclosure also provides a novel algorithm for black-boxfunction optimization based on random sampling, which is referred to insome instances as “Gradientless Descent.” The Gradientless Descentalgorithm converges exponentially fast under relatively weak conditionsand mimics the exponentially fast convergence of gradient descent onstrongly convex functions. It has been demonstrated that the algorithmis highly effective in practice, as will be shown with exampleexperimental results below.

It has also been demonstrated that the algorithm performs very wellempirically, even on high-dimensional problems. The algorithm issufficiently fast (constant time per suggested point x_(t)) that it issuitable for applications in which function evaluations are onlymoderately expensive.

The present disclosure assumes oracle access to (or at least the abilityto evaluate within a reasonable margin of error) an objective functionƒ:

→

to be optimized, and beyond that, as few assumptions are made aspossible. Without loss of generality, minimization is considered by thediscussion provided herein, wherein the goal is to find x∈argmin{ƒ(x):x∈

}. However, maximization goals can easily be accomplished with minorchanges to the algorithm.

Often, evaluation of ƒ is expensive in one sense or another; so oneprimary cost metric considered herein is the number of functionevaluations.

6.1 Example Related Algorithms

Given the wide applicability of black-box optimization, it is notsurprising that it has been extensively studied in many fields. Thesimplest algorithms include random search and grid search, which selectpoints uniformly at random or from a regular grid, respectively.Simulated annealing has been used for black-box optimization for decadesin a variety of fields (see, e.g., Kirkpatrick et al. Optimization bysimulated annealing. Science, 220 (New Series) (4598):0 671-680, 1983;and Brooks and Morgan. Optimization using simulated annealing. Journalof the Royal Statistical Society. Series D (The Statistician),44(2):241-257, 1995), as have genetic algorithms (see e.g., Rios andSahinidis. Derivative-free optimization: a review of algorithms andcomparison of software implementations. Journal of Global Optimization,560 (3):0 1247-1293, 2013 and references therein).

Another class of algorithms maintains a local set of points and updatesit iteratively. For example, the Nelder-Mead (Nelder and Mead. A simplexmethod for function minimization. The computer journal, 70 (4):0308-313, 1965.) algorithm maintains a simplex that it updates based on afew simple rules. More modern approaches develop local models,maintaining a trust region where the model is presumed to be accurate,and optimizing the local model within the trust region to select thenext point. See Rios and Sahinidis. Derivative-free optimization: areview of algorithms and comparison of software implementations. Journalof Global Optimization, 560 (3):0 1247-1293, 2013. and referencestherein; and Conn et al. Introduction to derivative-free optimization.SIAM, 2009. for broad treatments.

More recently introduced, Bayesian optimization (BO) algorithms (e.g.,Shahriari et al. Taking the human out of the loop: A review of Bayesianoptimization. Proceedings of the IEEE, 1040 (1):0 148-175, 2016) attemptto model the objective function over the entire feasible space and makean explicit tradeoff between exploration and exploitation explicit(i.e., treating optimization as an infinite-armed bandit problem). Mostresearchers model the objective using either Gaussian processes (see,e.g., Srinivas et al. Gaussian process optimization in the banditsetting: No regret and experimental design. ICML, 2010; and Snoek et al.Practical Bayesian optimization of machine learning algorithms. InAdvances in neural information processing systems, pp. 2951-2959, 2012),deep neural networks (see, e.g., Snoek et al. Scalable Bayesianoptimization using deep neural networks. In Proceedings of the 32ndInternational Conference on Machine Learning, pp. 2171-2180, 2015; andWilson et al. Deep kernel learning. In Proceedings of the 19thInternational Conference on Artificial Intelligence and Statistics, pp.370-378, 2016), or regression trees (see, e.g., Hutter et al. Sequentialmodel-based optimization for general algorithm configuration. InInternational Conference on Learning and Intelligent Optimization, pp.507-523. Springer, 2011; and Bergstra et al. Algorithms forhyper-parameter optimization. In Advances in Neural InformationProcessing Systems, pp. 2546-2554, 2011).

All these methods have tradeoffs in terms of speed of convergence,robustness, and scaling to large studies. Despite advances in algorithmsfor black-box optimization, random search remains popular amongpractitioners. This is speculated to be because it is easy to understandand implement, is dirt-cheap computationally, and has predictable anddependable (though mediocre) performance. Random search is also immuneto pathological objective functions or noise because it is oblivious(i.e., it does not adapt based on the values ƒ(x_(t)) of the points{x_(t)}_(t≥1) it chooses). Of course, this very obliviousness means itcannot exploit any properties of ƒ to converge faster.

Thus, there is a need for an advanced black-box optimization algorithmthat is more clever and faster than random search yet retains most ofrandom search's favorable qualities. The present disclosure provides analgorithm for black-box optimization called Gradientless Descent that isfast and easy to implement, performs well in practice, and hasinteresting convergence guarantees for a wide class of objectivefunctions.

The black-box optimization algorithm of the present disclosure isevaluated empirically against a state of the art implementation ofBayesian Optimization with Gaussian process modeling and demonstrated tooutperform the latter when the budget on evaluations is sufficientlylarge. It is then proven that convergence bounds on the algorithm areanalogous to strong bounds for gradient descent.

7. EXAMPLE GRADIENTLESS DESCENT ALGORITHM

An example Gradientless Descent algorithm (Algorithm 2) is providedbelow. Algorithm 2 is one example algorithm to accomplish certainaspects described herein can be modified in various ways to producevariants that are within the scope of the present disclosure.

Algorithm 2 uses the following notation: let d denote the dimensionalityof the problem, let s˜

(S) denote that s is sampled uniformly at random from S, and let diam(

) denote the diameter of

, i.e., diam (

):=max{∥x−x′∥:x,x′Σ

}.

Algorithm 2 is an iterative algorithm. In some implementations, whengenerating a new point in round t, it can sample uniformly at randomwith probability ε, or it can sample from a ball B_(t) of radius r_(t)around the best point seen so far, b_(t−1). The radius, r_(t), can be arandom sample from a geometric series; as can be seen in Section 9,which can allow the algorithm to converge rapidly towards a good point.

When ε>0, the algorithm can spend at least part of its timeinvestigating uniformly randomly sampled points. This can be used tohandle multiple minima, and can guarantee that the algorithm'sworst-case performance cannot be much worse than Random Search.

Algorithm 2 Gradientless Descent  1: input: Closed convex feasible set

 ⊂

^(d), objective function f to minimize, uniform-sampling weight ε ∈ [0,1], resolution δ > 0, number of points T ∈

₊ to evaluate.  2: Initialize t = 1.  3: Select x₁~ 

 ( 

) and receive feedback y₁: = f(x₁).  4: Set b₁ = x₁.  5: for t = 2, . .. , T do  6: Sample p~ 

 ([0, 1]).  7: if p ≤ ε then  8: Select x_(t)~ 

.  9: else 10:${{Let}R:} = {\left\{ {{{\delta \cdot 2^{k}}:0} \leq k \leq {\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}} \right\}.}$11: Sample r_(t)~ 

(R) 12: Let B_(t): = {x: ∥x − b_(t−1)∥ ≤ r_(t)}. 13: Sample {circumflexover (x)}~ 

(B_(t)). 14: Select as x_(t) the projection of {circumflex over (x)}onto

. 15: end if 16: Receive feedback y_(t): = f (x_(t)). 17: Set b_(t) =arg min{f (x): x ∈ {b_(t−1), x_(t)}} 18: end for

8. EXAMPLE EMPIRICAL EVALUATION

8.1 Example Methodology

The quality of each algorithm can be assessed based on its optimalitygap, the distance of its best-found score from the known optimal score,on selected benchmark functions, such as benchmark functions selectedfrom the 2009 Black-Box Optimization Benchmarking workshop of theGenetic and Evolutionary Computation Conference (GECCO) (Hansen et al.Real-Parameter Black-Box Optimization Benchmarking 2009: NoiselessFunctions Definitions. Research Report RR-6829, INRIA, 2009. URLhttps://hal.inria.fr/inria-00362633). For example, the Beale, Branin,Ellipsoidal, Rastrigin, Rosenbrock, Six Hump Camel, Sphere, andStyblinski benchmark functions or other functions with known optimalsolutions designed to test the ability of black-box optimizationroutines can be used. The quality metric of a given run on a singlebenchmark is the ratio of the resulting optimality gap to that producedby a Random Search run for the same duration (thus normalizing thevalues, allowing for comparison to benchmarks of differing sizedspaces). This value is averaged over 100 applications (to account forthe stochastic nature of most algorithms), and the mean of thisnormalized value over all benchmarks is taken resulting in the relativeoptimality gap of the algorithm applied to the benchmark.

8.2 Example Results

FIG. 8 shows the average optimality gap of each algorithm relative toRandom Search, in problem space dimensions of 4, 8, 16, and 32. Thehorizontal axis shows the progress of the search in terms of the numberof function evaluations, while the vertical axis is the mean relativeoptimality gap. For this study three algorithms were compared: SMAC(Hutter et al. Sequential model-based optimization for general algorithmconfiguration. In International Conference on Learning and IntelligentOptimization, pp. 507-523. Springer, 2011), an implementation ofGaussian Process Bandits with a Spectral Kernel (Quiñonero-Candela etal. Sparse spectrum Gaussian process regression. Journal of MachineLearning Research, 110 (June):0 1865-1881, 2010), and GradientlessDescent as described in this paper.

FIG. 8 shows all three algorithms are clearly superior to Random Searchafter the burn-in period (which is quite small—note thelogarithmic-scaling). Further, while Gradientless Descent lags behindthe Bayesian Optimization (BO) approach at first, it eventuallydominates: the higher the dimensionality of the problem, the earlier thebreak-event point appears to be.

What is not shown in FIG. 8 is that the BO methods can be considerablymore time consuming than Gradientless Descent. While the computationalspeed of BO may not be critical when evaluating ƒ is extremelyexpensive, in more moderate cases (especially when the number of pointsis large) the Bayesian Optimization itself may become as expensive asevaluating the objective function. By comparison, Gradientless Descentis essentially free —requiring only O(1) time per suggested point.

Finally, the 2× Random Search (a version of Random Search that isallowed to sample two points for every one of a normal point) is alsoplotted. While Li et al. have claimed 2× Random Search is highlycompetitive (see, Hyperband: A novel bandit-based approach tohyperparameter optimization. CoRR, abs/1603.06560, 2016. URLhttp://arxiv.org/abs/1603.06560), the data suggests this is only true athigh dimensions, and moreover Gradientless Descent consistentlydominates 2× Random Search after a burn-in period.

9. EXAMPLE ANALYSIS OF CONVERGENCE

In this section convergence results under some relatively weakassumptions on the problem domain are provided.

9.1 Preliminaries

A function ƒ is called L-Lipschitz if it is Lipschitz continuous withLipschitz constant L. The level-sets of ƒ are the preimages of theobjective function. Formally,

_(x):={x′:ƒ(x′)=ƒ(x)}.

The optimality gap of a point x is γ(x):=ƒ(x)−ƒ(x*), where x* is anyfeasible point minimizing ƒ. The optimality gap of a set of points isthe minimum optimality gap of its elements.

Definition 1: (β-balanced) Fix

,

′⊂

and x*∈

′. Let ρ(x,u):={x+tu:t∈

_(≥0)} for vector u∈

^(d). A function ƒ:

→

is called β-balanced with respect to x*∈

′ if for all level sets

₁=ƒ⁻¹(y₁) and

₂=ƒ⁻¹(y₂) with y₁>y₂ and all directions u∈

^(d), ∥u∥=1, then,

$\begin{matrix}{{\min\limits_{u}\left\{ {\frac{{x_{1} - x_{2}}}{{x_{1} - x^{*}}}:{x_{i} \in {{\rho\left( {x^{*},u} \right)}\bigcap\mathcal{L}_{i}\bigcap X^{\prime}}}} \right\}} \geq {\underset{u^{\prime}}{\beta\max}\left\{ {\frac{{{x_{1}^{\prime}}^{} - x_{2}^{\prime}}}{{x_{1}^{\prime} - x^{*}}}:{x_{i}^{\prime} \in {{\rho\left( {x^{*},u^{\prime}} \right)}\bigcap\mathcal{L}_{i}\bigcap X^{\prime}}}} \right\}}} & (1)\end{matrix}$

As an example, FIG. 9 provides an illustration of β-balancedness for twofunctions with level-sets shown. For β-balanced functions∥d−c∥/∥d−x*∥≥β∥a−b∥/∥a−x*∥ for two level sets and rays originating atx*.

Example 1: Spherical Level Sets

The function ƒ(x)=∥x∥^(p) for any p>0 has spherical level sets, and is1-balanced with respect to the optimal x*=0 in

′={x:ƒ(x)≤r^(p)} for any r∈

.

Example 2: Ellipsoidal Level Sets

Likewise, ƒ(x)=∥Ax∥^(p) for a positive definite matrix A and p>0 hasellipsoidal level sets and is 1-balanced with respect to the optimalx*=0 in

′={x:ƒ(x)≤r^(p)}.

Example 3: Spherical Level Sets with Constant Velocity

Fix u∈

^(d) with ∥u∥=1, and any constant α∈[0,1). A function ƒ with optimumx*=0 that has level sets ƒ⁻¹(r):={x:ƒ(x)=r} which are spheres of radiusr centered at αru is 1-balanced with respect to x* in ƒ⁻¹(r) for allr>0.

Given a smooth closed surface S and a point on that surface p, a ball Bis considered tangent to S at p if B's surface contains p and B'stangent at p equals that of S. If B is contained in S, it is consideredenclosed. Note the largest enclosed tangent sphere at p is in many (butnot all) cases equal to the osculating sphere from differentialgeometry.

Condition 2: Suppose there exists a closed connected

′⊂

with vol(X′)≥μ·vol(X) for μ>0 such that:

-   -   1. The optimal point x*=arg min{ƒ(x):x∈        } lies within        ′.    -   2. ƒ is β-balanced with respect to x*∈        ′ for some β>0.    -   3. Every point in        ′ is superior to every point outside of it with respect to ƒ.        That is, ƒ(x′)>ƒ(x) for all x′∈        ′, x∈        \        ′.    -   4. There exists θ>0 such that for every x∈        ′, the level-sets        _(x) admit enclosed tangent balls of radius at least θ∥x−x*∥        that lie entirely within        .

The first three sub-conditions on

′ serve to avoid problems with optimal points lying on boundaries orcorners of the feasible region.

The final sub-condition, requiring θ>0 ensures that it shall not be toohard to find directions that improve ƒ, which is particularly importantin high-dimension. Loosely speaking, θ>0 implies that the level set mustnot have corners.

An example of a function that does not satisfy this condition is the

_(∞) norm of x, i.e., ƒ(x):=∥x∥_(∞)=max{x_(i)}. If x_(i)=1 for all i,then only when all coordinates are reduced does the objective decrease,hence a random direction has only a 2^(−d) probability of decreasing theobjective from x.

9.2 Example Convergence with a Single Basin of Attraction

A convergence result for Gradientless Descent under some simplifyingassumptions is initially provided, and then generalized. In particular,treatment of multiple local minima is temporarily deferred.

Recall a sublevel set is the set of points with ƒ(x)≤y for some y. Inparticular, it is assumed that the feasible region

for the objective ƒ is itself a sublevel set containing is a singlebasin of attraction—meaning its sublevel sets are connected.

For simplicity, is assumed that uniform sampling is given zero weight,i.e., ε=0, and that the equations are within a single basin ofattraction. Multiple local optima will be addressed later in thissection.

Theorem 3: Let ƒ be a smooth L-Lipschitz objective function satisfyingCondition 2 on feasible region

⊂

^(d) with a single basin of attraction in

, and moreover suppose

is a sublevel set of ƒ. Let {x_(t)}_(t≥1) be the points selected byGradientless Descent with ε=0. Then there exists an absolute constantρ>0 such that for all η>0, the optimality gap γ_(T) after T stepssatisfies:

${\Pr\left\lbrack {\gamma_{T} \geq {L\max\left( {\frac{5\delta\sqrt{d}}{2\theta},{{{diam}(X)}{\exp\left( {- \Lambda} \right)}}} \right)}} \right\rbrack} \leq \eta$where$\Lambda:={\frac{\left( {T - 1} \right)\rho\beta\theta}{5{d \cdot {\log_{2}\left( \frac{{diam}(X)}{\delta\sqrt{d}} \right)}}} - \sqrt{2\left( {T - 1} \right)\frac{\beta\theta}{5d}{{\ln\left( {2/\eta} \right)}.}}}$

In other words, the optimality gap shrinks at least exponentially fast,with an exponent of:

$\frac{\beta\theta T}{5{d \cdot {\log_{2}\left( \frac{{diam}(X)}{\delta\sqrt{d}} \right)}}},$

until it gets within distance

$\frac{5\delta\sqrt{d}}{2\theta}$

of the optimum (where δ is a user-chosen parameter that may bearbitrarily small).

Note while ρ is hard to calculate, it is not impractically small; InLemma 6 it is argued that it converges to a value above 0.158 in thelimit of infinite dimensions.

Proof. Define the following potential function:

Φ(x):=ln(max{∥x′−x*∥:x′∈

_(x)}).

Since ƒ is L-Lipschitz continuous by assumption, then γ(x)≤L∥x−x*∥≤Lexp(Φ(x)), so bounds on Φ translate to bounds on the optimality gap. Byassumption, ε=0, so x₁ is chosen uniformly at random, and all subsequentpoints x_(t) are selected via sampling from a ball around b_(t−1) instep t (this is referred to herein as ball-sampling).

Define Φ_(t):=Φ(b_(t)), and consider the random variableΔ_(t):=Φ_(t)−Φ_(t−1). Because of the arg min in Algorithm 2 and becausethe level sets of ƒ are nested, it's trivially true that Φ isnonincreasing (i.e., Δ_(t)≤0). However to show actual progress,high-probability upper bounds on Σ_(t)Δ_(t) are developed, ideally ofthe form Σ_(t+2) ^(T)Δ_(t)≤−c(T−1) for some constant c.

FIG. 10 shows an event ε_(t) that corresponds to “significant” progressin reducing Φ_(t); ε_(t) is defined as follows. Let b_(t−1) be the bestpoint seen so far, and let x* be the optimum. Let q=(1−v)b_(t−1)+vx* forsome v∈(0,1) to be determined, and let B_(q) be an enclosed tangent ballto

_(q) at q, of radius θ∥q−x*∥=θ(1−v)∥b_(t−1)−x*∥ as guaranteed byCondition 2. Then ε_(t) is the event that our sample x_(t)∈B_(q). Thus,with reference to FIG. 10 , significant progress event, ε_(t), meanssampling a point in the shaded region.

Let c_(q) be the center of B_(q), and let

=∥b_(t−1)−c_(q)∥. Suppose the sampling ball radius r_(t) lies in [

/2√{square root over (d)},

/√{square root over (d)}]. This event is denoted by

_(t); Pr[ε_(t)|

_(t)] is now bound using Lemma 6. If the radius r_(q) of B_(q) satisfies

${r_{q} \geq {\left( {1 - \frac{1}{4d}} \right)\ell}},$

then scaling all distances by 1/

and applying Lemma 6 establishes that the fraction of the samplingball's volume that lies in B_(q) is Ω(1), i.e.,vol(B_(t)∩B_(q))=Ω(vol(B_(t))).

It is now shown that v=θ/5d is sufficient to ensure r_(q) is largeenough. Define dist(x,S):=min{dist(x,s):s∈S}. Note

$r_{q} \geq {\left( {1 - \frac{1}{4d}} \right)\ell}$

if

${{{dist}\left( {b_{t - 1},B_{q}} \right)} \leq \frac{\ell}{4d}},$

since

=dist(b_(t−1),B_(q))+r_(q). Next notedist(b_(t−1),B_(q))≤∥b_(t−1)−q∥=v∥b_(t−1)−x*∥. Hence it is sufficient toprove

$\begin{matrix}{{v{{b_{t - 1} - x^{*}}}} \leq \frac{\ell}{4d}} & (2)\end{matrix}$

However,

≥θ∥q−x*∥=θ(1−v)∥b _(t−1) −x*|  (3)

Therefore close-enough holds when

${v \leq \frac{\theta\left( {1 - v} \right)}{4d}},$

which one can easily confirm for v=θ/5d after noting that θΣ[0,1] andd≥1. Hence Pr[ε_(t)

_(t)]=Ω(1) when v=θ/5d.

Next it can be proven that ε_(t) implies a significant decrease in Φ,i.e., that |Δ_(t)| is large. Consider that if ε_(t) occurs, then x_(t)must lie in a levelset

′ at least as good as

_(q), and

′ must contain a point q′ in the convex hull of {b_(t−1),x*} at leastdistance

${v{{b_{t - 1} - x^{*}}}} = {\frac{\theta}{5d}{{b_{t - 1} - x^{*}}}}$

from b_(t−1). Refer to FIG. 11 for an illustration. In particular, FIG.11 provides an illustration of the ball sampling analysis. When goingfrom b_(t−1) to x_(t), the potential Φ drops from log∥z(b_(t−1))−x*∥ tolog∥z(x_(t))−x*∥.

Here there is a direction u such that with respect to that direction,

′ is at least a

$1 - \frac{\theta}{5d}$

fraction closer to x* than is b_(t−1). Hence by Condition 2 and thedefinition of β-balancedness, in every direction u moving from ρ(x*,u)∩

_(b-1) to ρ(x*,u)∩

′ results in being at least a

$1 - \frac{\beta\theta}{5d}$

fraction closer to x*.

Next, it is shown that this implies that t

exp(Φ_(t)) shrinks by at least a

$1 - \frac{\beta\theta}{5d}$

fraction. Suppose z(x)∈arg max{∥x′−x*∥:x′Σ

_(x)}, and consider the direction u proportional to z(x_(t))−x*. Becauseof the β-balancedness of ƒ, the point w∈ρ(x*,u)∩

_(b) _(t-1) cannot be too close to z(x_(t)). In particular:

$\begin{matrix}{{1 - {{{{z\left( x_{t} \right)} - x^{*}}}/{{w - x^{*}}}}} \geq \frac{\beta\theta}{5d}} & (4)\end{matrix}$

Clearly, ∥z(b_(t−1))−x*∥≥∥w−x*∥ by construction, so by substitution ofEquation 4 and algebra, it follows that

${{{z\left( x_{t} \right)} - x^{*}}} \leq {\left( {1 - \frac{\beta\theta}{5d}} \right){{{z\left( b_{t - 1} \right)} - x^{*}}}}$

Thus

$\Delta_{t} \leq {\ln\left( {1 - \frac{\beta\theta}{5d}} \right)}$

when ε_(t) occurs. Since 1+z≤e^(z) for all z∈

with

$z = {- \frac{\beta\theta}{5d}}$

it is inferred that

$\Delta_{t} \leq {- \frac{\beta\theta}{5d}}$

conditioned on ε_(t)

By time T, if

$\gamma_{T} \leq \frac{5L\delta\sqrt{d}}{2\theta}$

there is nothing more to prove. Otherwise, for all t=1, 2, . . . , T,b_(t) must be at distance at least

$\frac{5\delta\sqrt{d}}{2\theta}$

from any optimal point by the Lipschitz assumption on ƒ. This distanceis sufficiently large that the algorithm has a chance to select a radiusthat ensures

_(t), since with lower-bound-for-ell it guarantees δ≤

/2√{square root over (d)}. Hence

${\Pr\left\lbrack \mathcal{R}_{t} \right\rbrack} = {1/{\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}}$

for all 2≤t≤T.

Now consider Φ_(T)=Φ₁+Σ_(t=2) ^(T). Clearly, Φ₁≤ln(diam(

)). A probabilistic upper bound is desired on Δ_(2:T):=Σ_(t=2)^(T)Δ_(t), ideally of −c(T−1) for some c>0.

Let

${\overset{\hat{}}{\Delta}}_{t}:={\max{\left( {\Delta_{t},{- \frac{\beta\theta}{5d}}} \right).}}$

Clearly, Σ_(t=2) ^(T)Δ_(t)≤Σ_(t=2) ^(T){circumflex over (Δ)}_(t), andfor all t then {circumflex over (Δ)}_(t)∈

$\left\lbrack {{- \ \frac{\beta\theta}{5d}},\ 0} \right\rbrack,{{{and}{\overset{\hat{}}{\Delta}}_{t}} = {- \frac{\beta\theta}{5d}}}$

conditioned on ε_(t). Let ρ_(t)=Pr[ε_(t)|

_(t)] and recall it is a positive constant. Also, for all t,

${{\Pr\left\lbrack \varepsilon_{t} \right\rbrack} \geq {{\Pr\left\lbrack {\varepsilon_{t}❘\mathcal{R}_{t}} \right\rbrack}{\Pr\left\lbrack \mathcal{R}_{t} \right\rbrack}}} = {\rho_{t}/{{\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}.}}$

Azuma's inequality can then be applied on {circumflex over(Δ)}_(2:T):=Σ_(t=2) ^(T){circumflex over (Δ)}_(t) to prove

$\begin{matrix}{{\Pr\left\lbrack {{❘{{\overset{\hat{}}{\Delta}}_{2:T} - {{\mathbb{E}}\left\lbrack {\overset{\hat{}}{\Delta}}_{2:T} \right\rbrack}}❘} > \alpha} \right\rbrack} \leq {2{\exp\left( \frac{{{- \alpha^{2}} \cdot 5}d}{2\left( {T - 1} \right)\beta\theta} \right)}}} & (5)\end{matrix}$

To ensure this probability is at most η, it suffices to set

$\alpha = {{{\alpha(\eta)}:} = {\sqrt{2\left( {T - 1} \right)\frac{\beta\theta}{5d}{\ln\left( {2/\eta} \right)}}.}}$

Moreover,

${{\mathbb{E}}\left\lbrack {\overset{\hat{}}{\Delta}}_{2:T} \right\rbrack} \leq {- {\sum\limits_{t = 2}^{T}{\frac{\beta\theta}{5d}\rho_{t}/{\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}}}} \leq {- \frac{{\rho\left( {T - 1} \right)}\beta\theta}{5{d \cdot {\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}}}}$

where ρ=min_(t)ρ_(t) is the minimum probability of significant progressbeing made in any round when the appropriate ball-sampling radius ischosen. Since Δ_(t)≤{circumflex over (Δ)}_(t) with certainty, the resultis the corresponding bound for Δ_(2:T). That is, with α(η) defined asabove, for all η

$\begin{matrix}{{\Pr\left\lbrack {\Phi_{T} \geq {{\ln\left( {{diam}(\mathcal{X})} \right)} - \frac{\left( {T - 1} \right){\rho\beta\theta}}{5{d \cdot \log_{2}}\frac{{diam}(\mathcal{X})}{\delta\sqrt{d}}} + {\alpha(\eta)}}} \right\rbrack} \leq \eta} & (6)\end{matrix}$

As mentioned, the optimality gap satisfies γ(x) L∥x−x*∥≤L exp(Φ(x)), sofor all η>0

Pr[γ _(T) ≥L diam(

)·exp(−Λ)]≤η  (7)

where Λ is as defined in the theorem statement.

Recall this argument only holds if γ_(T)≤5Lδ√{square root over (d)}/2θ,so a maximum is introduced over the bound in Eq. (7) and 5Lδ√{squareroot over (d)}/2θ to complete the proof.

9.3 A Concrete Example: Ellipsoidal Level Sets

To make Theorem 3 more concrete, a corollary is proven that formalizesthe following result: if a function has a sufficiently large regionaround the global optimum x* in which ƒ(x)=ψ(∥x−x*∥) for some strictlymonotonically increasing function ψ:

→

and some scaled Euclidean norm ∥⋅∥, the convergence is stillexponentially fast, and only mildly depends on how extreme the distancescaling is, even in arbitrarily high dimension.

Specifically, a lower bound is provided on θ in Condition 2 ford-dimensional ellipsoids. For an ellipsoid E:={x∈

^(d):(x−c)^(T)M(x−c)≤1}, κ(E) is defined to be the condition number ofthe matrix M. Hence, it is the square of the ratio of the maximumprinciple axis length to the minimum.

Theorem 4: An ellipsoid E in arbitrary dimension d has maximum principalcurvature everywhere at most 2κ(E)/diam(E).

Proof. Assume that our ellipse E is not degenerate, else κ(E)=∞ andthere is nothing to prove. Since curvature is invariant to translationsand rotations, it can be assumed without loss of generality that E iscentered at the origin, and that our coordinate system has basis vectorsalong the principle axes of E. Hence, there exists A=diag(a₁, a₂, . . ., a_(d)) such that M=A^(T)A, and a₁≤a₂≤ . . . ≤a_(d). Defineƒ(x):=x^(T)Mx=Σ_(i)a_(i) ²x_(i) ², and

_(x):={x′: ƒ(x′)=ƒ(x)}.

The curvature is bound at an arbitrary point p as follows: Let

${{g(\varepsilon)}:={{- \varepsilon}\frac{\nabla{f(p)}}{{\nabla{f(p)}}}}},$

let q(ε):=p+g(ε), and find the vector v(ε)∈

^(d) to be orthogonal to ∇ƒ(p) and of minimum length such thatq(ε)+v(ε)∈

_(p). Now consider the largest radius circle C(ε) in the plane thatspans {∇ƒ(p), v(ε)} that contains p, q(ε)+v(ε), and q(ε)−v(ε). Let itsradius be r(ε). The maximum principal curvature at p is then at most1/lim_(ϵ→0)r(ε), because in the limit C(ε) approaches the osculatingcircle of the curve created by intersecting E with the normalhyperplane.

To begin, fix p∈E and take the Taylor series approximation of ƒ at p:

$\begin{matrix}{{f\left( {p + \Delta} \right)} \approx {{f(p)} + {\Delta^{T}{\nabla{f(p)}}} + {\frac{1}{2}\Delta^{T}{H_{f}(p)}\Delta}}} & (8)\end{matrix}$

where H_(ƒ)p is the Hessian of ƒ at p.

Note ∇ƒ(p)=[2a₁ ²p₁, 2a₂ ²p₂, . . . , 2a_(d) ²p_(d)] andH_(ƒ)(p)=diag(2a₁ ², 2a₂ ², . . . , 2a_(d) ²). Since g=g(ε) isproportional to ∇ƒ(p), then:

ƒ(p+g)=ƒ(p)−ε∥∇ƒ(p)∥+

(ε²)  (9)

Since v is orthogonal to ∇ƒ(p), v^(T)∇ƒ(p)=0, then:

ƒ(p+v)≈ƒ(p)+Σ_(i) a _(i) ² v _(i) ²  (10)

In the limit as →0, the approximation

ƒ(p+g+v)≈ƒ(p)+(ƒ(p+g)−ƒ(p))+(ƒ(p+v)−ƒ(p))  (11)

can be used since any cross terms only contribute

(ε²), while the terms in Eq. 11 are all Ω(ε), as will be demonstrated.

Recall the aim is to find the minimum length v(ε) orthogonal to ∇ƒ(p)such that lim_(ϵ→0)(ƒ(p+g(ε)+v(ε))−ƒ(p))/ε=0. A trivial lower bound onthis is the minimum length v(ε) of any vector such thatlim_(ϵ→0)(ƒ(p+g(ε)+v(ε))−ƒ(p)/ε)=0. Sum-approx is then combined withtaylor-parallel and taylor-orthogonal (ignoring

(ε²) terms) to obtain the following optimization problem:

min{∥v∥:Σ _(i) a _(i) ² v _(i) ²=ε∥∇ƒ(p)∥}.  (12)

If it is equivalently chosen to minimize ∥v∥² instead, and it is chosento rewrite the resulting optimization problem using w_(i):=a_(i) ²v_(i)², the following problem with the same optimum is obtained:

min{Σ_(i) w _(i) /a _(i) ² :w≥0 and Σ_(i) w _(i)=ε∥∇ƒ(p)∥}  (13)

Let ê_(i) be the i^(th) unit basis vector. It is straightforward to seethe optimum solution is w*=ε∥∇ƒ(p)∥ê_(d), since if w_(i)>0 for i≠d,w_(i)êd_(d)−w_(i)ê_(i) could be added to w to generate a solution whichis at least as good (specifically, at least w_(i)(a_(d) ⁻²−a_(i) ⁻²) orbetter, which is nonnegative since a_(i)≤a_(d) by assumption). Theoptimization from opt-problem thus has an optimum v* withv*_(d)=√{square root over (ε∥∇ƒ(p)∥)}/a_(d) and v_(i)*=0 for all i≠d.Hence ∥v*∥=√{square root over (ε∥∇ƒ(p)∥)}/a_(d).

Next, note that a circle of radius r centered at the origin, the linex₁=r−ε passes through x₂=√{square root over (r²−(r−ε)²)}=√{square rootover (2rε−ε²)}. Thus, the osculating circle through p is has radiuslim_(ε→0)r(ε), where r:=r(ε) satisfies √{square root over(2rε−ε²)}≥∥v*∥. That is,

√{square root over (2rε−ε ²)}≥√{square root over (ε∥∇ƒ(p)∥)}/a_(d)  (14)

which reduces to

$\begin{matrix}{r \geq {\frac{1}{2}{\left( {\frac{{\nabla{f(p)}}}{a_{d}^{2}} + \varepsilon} \right).}}} & (15)\end{matrix}$

It is next proven that ∥∇ƒ(p)∥ 2a₁ for all p∈E. Since p∈E, thenp^(T)∇ƒ(p)=2Σ_(i)a_(i) ²p_(i) ²=2. By the Cauchy-Schwarz inequality,p^(T)∇ƒ(p)≤∥p∥∥∇ƒ(p)∥, hence ∥∇ƒ(p)∥≥2/∥p∥. It is easy to verify thatsince E is centered at the origin and its principal axes are aligned,min{∥p∥:p∈E}=1/a₁ satisfied by p=ê₁/a₁. Therefore ∥∇ƒ(p)∥≥2a₁, and it isconcluded that

$\begin{matrix}{r \geq {\frac{1}{2}\left( {\frac{{\nabla{f(p)}}}{a_{d}^{2}} + \varepsilon} \right)} \geq \frac{a_{1}}{a_{d}^{2}}} & (16)\end{matrix}$

The maximum principal curvature at any point p∈E is the reciprocal ofthe minimum possible radius of an osculating circle in any plane normalto E at p. Hence the maximum principal curvature over all E is at mosta_(d) ²/a₁. Since diag(E)=2/a₁ and

${{\kappa(E)} = \frac{a_{d}^{2}}{a_{1}^{2}}},$

this equals the claimed bound of 2κ(E)/diam(E).

Corollary 5: Fix any

and objective function ƒ that satisfies Condition 2 for some constantsμ, θ with

′ equal to an ellipsoid E. Suppose E:={x∈

^(d):(x−c)^(T)M(x−c)≤1} for some c and M, and suppose

_(x) ={x′:(x′−c)^(T) M(x′−c)=(x−c)^(T) M(x−c)}

Then Condition 2 holds for θ=1/κ(E).

Lemma 6: Let B₁ and B₂ be two balls in

^(d) of radii r₁ and r₂ respectively whose centers are unit-distanceapart. If r₁

${\in {{\left\lbrack {\frac{1}{2\sqrt{d}},\ \frac{1}{\sqrt{d}}} \right\rbrack{and}r_{2}} \geq {1 - \frac{1}{4d}}}},$

then

vol(B ₁ ∩B ₂)=Ω(vol(B ₁))

Proof. The intersection B₁∩B₂ is composed of two hyperspherical capsglued end to end. A lower bound is placed on vol(B₁∩B₂) by the volume ofthe cap C₁ of B₁. If B₁ is centered at the origin, and B₂ at ê₁ whereê_(i) is the i^(th) unit basis vector, then this cap is {x:x∈B₁,x₁≥c₁},where c₁ is the cap base height. From classic geometry,

$\begin{matrix}{{c_{1} = {\frac{1}{2}\left( {1 + r_{1}^{2} - r_{2}^{2}} \right)}}.} & (17)\end{matrix}$

For small values of d, it suffices to observe that r₁>1−r₂ so the ballshave an intersection with non-negligible volume.

For larger values of d, it is known that if “slices” are taken through aball of radius r centered at the origin in d dimensions, the volume ofthe slices varies approximately as a normal distribution

${N\left( {0,\frac{r^{2}}{d - 1}} \right)}.$

For a textbook treatment, see Ball, Keith. An elementary introduction tomodern convex geometry. Flavors of geometry, 31:0 1-58, 1997. URLhttp://library.msri.org/books/Book31/files/ball.pdf. Hence up toconstants, vol(C₁) equals the probability that a random draw from thatnormal distribution exceeds c₁. Thus if c₁≤r₁/√{square root over (d−1)},this probability is at least ≈0.1586 (from the cumulative distributionfunction of the standard normal).

Suppose

$r_{1} = {{\frac{\alpha}{2\sqrt{d}}{and}r_{2}} = {1 - {\frac{1}{4d}.}}}$

It will be shown that c₁≤r₁/√{square root over (d)} for all α∈[1,2], orequivalently, r₁/√{square root over (d)}−c₁≥0. From cap-height,

$c_{1} = {\frac{1}{4d} + \frac{\alpha^{2}}{8d} - \frac{1}{32d^{2}}}$

and r₁/√{square root over (d)}=α/2d. Hence r₁/√{square root over(d)}−c₁≥0 iff

${{\frac{1}{8d}\left( {{4\alpha} - 2 - \alpha^{2} + \frac{1}{4d}} \right)} \geq 0};$

this holds for all α∈[1,2] since it holds for α=1 and α=2, and α

4α−2−α² is concave so its value at any α∈[1,2] is lower bounded by aconvex combination of its values at α=1 and α=2.

For

${r_{2} > {1 - \frac{1}{4d}}},$

note that increasing the radius of B₂ only increases vol(B₁∩B₂), simplybecause vol(B₁∩B′) vol(B₁∩B) whenever B′⊂B.

9.4 A More General Example Convergence Result

In this section the results of section 9.2 are extended to functionswith multiple local optima. In this case either uniform random samplesor ball-samples may cause some b_(t) to be in a different basin ofattraction than b_(t−1), i.e., the sublevel set of {x:ƒ(x)≤ƒ(b_(t−1))}might not contain a path from b_(t−1) to b_(t).

The convergence result is generalized by considering the largestconnected sublevel set

′ of ƒ that is contained entirely in the feasible region

. So, once a point in

′ is selected, all future iterations will remain within. That is, x_(τ)∈

′ implies b_(t)∈

′ for all t≥τ. From then on, the analysis from the proof of Theorem 3may be applied, with the caveat that only ball-sampling is performedwith probability 1−ε. Recall μ=vol(

′)/vol(

).

Theorem 7: Let

⊂

^(d) be a closed connected set and let

′ be the largest connected sublevel set

′ of ƒ, where ƒ is a smooth L-Lipschitz objective function satisfyingCondition 2. Then if {x_(t)}_(t≥1), are the points selected byGradientless Descent, the optimality gap γ_(T) after T steps satisfies,for any η∈(0,1)

$\begin{matrix}{{P{r\left\lbrack {\gamma_{T} \geq {L{\max\left( {\frac{5\delta\sqrt{d}}{2\theta},{{{diam}(\mathcal{X})} \cdot {\exp\left( {- \hat{\Lambda}} \right)}}} \right)}}} \right\rbrack}} \leq \eta} & (18)\end{matrix}$

where

${\hat{\Lambda}:={\frac{\rho\hat{S}{\beta\theta}}{5{d \cdot {\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}}} - \sqrt{2\overset{\hat{}}{S}\frac{\beta\theta}{5d}\left( {\ln 6/\eta} \right)}}},$

where ρ>0 is the absolute constant from Theorem 3, and

$\overset{\hat{}}{S}:={{\overset{\hat{}}{S}\left( {T,\varepsilon,\eta} \right)} = {{{\left( {1 - \varepsilon} \right)\left( {T - \tau - 1} \right)} - {\sqrt{\left( {\left( {T - \tau - 1} \right)/2} \right){\ln\left( {3/\eta} \right)}}{for}\tau}} = \frac{\ln\left( {2/\eta} \right)}{\varepsilon\mu}}}$

is a probabilistic lower bound on the number of ball-sampling roundsafter sampling a point in

′ that holds with probability η/3.

Proof. For any τ, observe that Pr[b_(τ∉)

_(′]≤()1−εμ)^(τ). Hence to ensure this probability is below η/3, it issufficient to set

$\tau = {{\log\left( {3/\eta} \right)}/{{\log\left( \frac{1}{1 - {\varepsilon\mu}} \right)}.}}$

Note that

${\ln\left( \frac{1}{1 - x} \right)} \geq x$

for all x∈[0,1), so

${{\ln\left( \frac{1}{1 - {\varepsilon\mu}} \right)} \geq {\varepsilon\mu}},$

so to ensure the miss probability is at most η, it is sufficient to set

$\tau = {\frac{\ln\left( {2/\eta} \right)}{\varepsilon\mu}.}$

Next, the number of ball-samples S taken in steps t∈(τ,T] are considered(as opposed to uniform random samples, i.e., S is |{t:t∈(τ,T],x_(t)˜

(B_(t)) for some ball B_(t)}|). Note S is distributed as a binomialdistribution B(T−τ−1,1−ε), so by Chernoff's inequality Pr[S<

[S]−α]≤exp(−2α²/(T−τ−1)), for all α≥0. To ensure this probability is atmost η/3 it is sufficient to set α=√{square root over(((T−τ−1)/2)ln(3/η))}.

Substituting for Ŝ(T, ε, η) and τ, it is found that S<Ŝ(T,ε,η)≤η/3.Hence by the union bound, the probability that b_(τ)∈

′ and there are Ŝ(T,ε,η) ball-samples subsequent to τ is at least1−2η/3. Theorem 3 can then be applied with a residual probability offailure of η/3 to obtain the claimed result.

9.5 Example Comparison to Random Search and Gradient Descent

Gradientless Descent combines many of the desirable properties of randomsearch and gradient descent. Like random search, it is robust topathological objectives. It is non-oblivious, but only slightly, insofaras it only considers the rank ordering of points. Hence its behavior isinvariant under monotone transformations of the objective value, and itis immune to noise that does not change the rank order of the evaluatedpoints x_(t) _(t>0) . Insofar as it does not model the objective in anyway, it is also very robust even to many kinds of noise that do re-rankthe points.

Remarkably, despite these similarities to random search, GradientlessDescent has convergence properties not unlike gradient descent, withoutusing any gradients. Gradient descent is known to converge exponentiallyfast to the optimal solution for suitably strongly-convex objectivefunctions ƒ:

Theorem 8: Let ƒ:

→

a strongly convex, L-lipschitz continuous function on

⊆

^(d), such that there exist constants 0<m<M with mI≤∇²ƒ(x)≤MI for all xε

(were I is the identity matrix), and such that the unique minimizer x*of ƒ lies in

. Then gradient descent with fixed step size 1/M starting at x₀satisfies ƒ(x_(t))−ƒ(x*)≤c^(t)(ƒ(x₀)−ƒ(x*)) where

$c = {\left( {1 - \frac{m}{M}} \right).}$

See section 9.3 of Boyd and Vandenberghe. Convex Optimization. CambridgeUniversity Press, Cambridge.

Note the number of iterations to reach a desired accuracy dependslinearly on M/m, which Boyd and Vandenberghe point out is a bound on thecondition number on the sublevel sets {x:ƒ(x)≤y}.

These results are now compared to the convergence results forGradientless Descent. To make the comparison fair, gradient descent ischarged d+1 for an evaluation of ƒ that provides ƒ(x) and the gradient∇ƒ(x), where there are d dimensions. Inspecting Theorem 3, it can beseen that before hitting the minimum resolution δ, the optimality gapshrinks roughly as c^(t) for

$c = {1 - {{\Theta\left( \frac{\theta\beta}{d \cdot {\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}} \right)}.}}$

It has been proven that the convergence exponent for GradientlessDescent depends linearly on condition number of sublevel sets when thelevel sets are ellipsoids, in a direct analogy to gradient descent. Moregenerally, the θβ terms represent this dependence in the statement ofTheorem 3. Hence after d steps, Gradientless Descent will shrink theoptimality gap by a factor of roughly

${c^{d} = {1 - {\Theta\left( {{\theta\beta}/{\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)}} \right)}}},$

while gradient descent will shrink it by

$1 - \frac{m}{M}$

with a single evaluation that includes gradients, which is analogous to1−Θ(θβ).

So, if the gradient computation costs d, Gradientless Descent convergesanalogously to gradient descent, aside from the

$\log_{2}\left( \frac{{diam}(\mathcal{X})}{\delta\sqrt{d}} \right)$

term in the denominator. This term exists because the appropriate radiusneeds to be guessed (roughly θ∥b_(t−1)−x*∥/√{square root over (d)}).Bear in mind that in many cases, gradient descent is also similarlyslowed, because the appropriate step size for gradient descent based onM which is not typically known apriori. If the step size had to beguessed, or a line search gradient descent had to be performed, asimilar penalty could be incurred.

The similarities in the rates of convergence are striking in light ofthe simplicity of Gradientless Descent and the differences inassumptions about ƒ. In particular, Gradientless Descent does notrequire convexity at all, but only balanced level-sets without corners.For example, it will perform equally well on ƒ(x)=∥x∥₂−sin(∥x∥₂) as onƒ(x)=∥x∥₂ ², exhibiting exponentially fast convergence on both, withonly an inverse linear dependence on dimensionality.

Like gradient descent, Gradientless Descent may be viewed as a basicalgorithmic chassis upon which more sophisticated variants can be built.For example, just as one may decay or adaptively vary learning rates forgradient descent, one might change the distribution from which theball-sampling radii are chosen, perhaps shrinking the minimum radius δas the algorithm progresses, or concentrating more probability mass onsmaller radii. As another example, analogously to adaptiveper-coordinate learning rates (see, e.g., Duchi et al. Adaptivesubgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 120 (July):0 2121-2159, 2011; andMcMahan and Streeter. Adaptive bound optimization for online convexoptimization. In COLT 2010—The 23rd Conference on Learning Theory,Haifa, Israel, Jun. 27-29, 2010, pp. 244-256, 2010), the shape of theballs being sampled could be adaptively changed into ellipsoids withvarious length-scale factors. Thus, as used herein, the term “ball” doesnot exclusively refer to circular or spherical shaped spaces but canalso include ellipsoids or other curved, enclosed shapes.

10. EXAMPLE DEVICES AND SYSTEMS

FIG. 12 depicts a block diagram of an example computing system 100according to example embodiments of the present disclosure. The examplecomputing system 100 can include one or more user computing devices 132;one or more manager computing devices 102; one or more suggestion workercomputing devices 124; one or more early stopping computing devices 128;one or more evaluation worker computing devices 130; and a persistentdatabase 104.

The database 104 can store a full state of one or more Trials and/orStudies along with any other information associated with a Trial or aStudy. The database can be one database or can be multiple databasesoperatively connected.

The manager computing device(s) 102 can include one or more processors112 and a memory 114. The one or more processors 112 can be any suitableprocessing device (e.g., a processor core, a microprocessor, an ASIC, aFPGA, a controller, a microcontroller, etc.) and can be one processor ora plurality of processors that are operatively connected. The memory 114can include one or more non-transitory computer-readable storagemediums, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magneticdisks, etc., and combinations thereof. The memory 114 can store data 116and instructions 118 which are executed by the processor(s) 112 to causethe computing system 102 to perform operations.

Similar to the manager computing device(s) 102, each of: the one or moreuser computing devices 132; the one or more suggestion worker computingdevices 124; the one or more early stopping computing devices 128; andthe one or more evaluation worker computing devices 130 can include oneor more processors (e.g., a processor core, a microprocessor, an ASIC, aFPGA, a controller, a microcontroller, etc.) and a memory (e.g., RAM,ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc.) asdescribed above with respect to reference numerals 112 and 114. Thus,each device can include processor(s) and a memory as described above.

The manager computing device(s) 102 can include an API handler 120 and adangling work finder 122. The API handler 120 can implement and/orhandle requests that come from the user computing device(s) 132 via anAPI. The API can be a REST API and/or can use an internal RPC protocol.The API handler 120 can receive requests from the user computingdevice(s) 132 that use the API (e.g., a request to check the status ofan operation) and can communicate with the one or more suggestion workercomputing devices 124; one or more early stopping computing devices 128;one or more evaluation worker computing devices 130; and/or a persistentdatabase 104 to provide operations and/or information in response to theuser request via the API.

The dangling work finder 122 can restart work lost to preemptions. Forexample, when a request is received by a suggestion worker computingdevice 124 to generate suggestions, the suggestion worker computingdevice 124 can first place a distributed lock on the correspondingStudy, which can ensure that work on the Study is not duplicated bymultiple devices or instances. If the suggestion worker computing device124 instance fails (e.g., due to e.g. hardware failure, job preemption,etc.), the lock can expire, making it eligible to be picked up by thedangling work finder 122 which can then reassign the Study to adifferent suggestion worker computing device 124.

In some implementations, if a Study is picked up by the dangling workfinder 122 too many times, the dangling work finder 122 can detect this,temporarily halt the Study, and alert an operator to the crashes. Thiscan help prevent subtle bugs that only affect a few Studies from causingcrash loops that can affect the overall stability of the system.

Each of the API handler 120 and the dangling work finder 122 can includecomputer logic utilized to provide desired functionality. Each of theAPI handler 120 and the dangling work finder 122 can be implemented inhardware, firmware, and/or software controlling a general purposeprocessor. For example, in some implementations, each of the API handler120 and the dangling work finder 122 includes program files stored on astorage device, loaded into a memory and executed by one or moreprocessors. In other implementations, each of the API handler 120 andthe dangling work finder 122 includes one or more sets ofcomputer-executable instructions that are stored in a tangiblecomputer-readable storage medium such as RAM hard disk or optical ormagnetic media.

In some implementations, the user computing device(s) can includepersonal computing devices, laptops, desktops, user server devices,smartphones, tablets, etc. In some implementations, the user computingdevice(s) 132 can interact with the API handler 120 via an interactiveuser interface. In some implementations, the user computing device(s)132 can perform suggestion evaluation in addition to or alternatively tothe evaluation worker computing device(s) 130. In some implementations,a user can evaluate a suggested set of parameters offline and then enterthe result of the evaluation into the user computing device 132 which isthen communicated to the manager computing device 102 and stored in thepersistent database 104.

The suggestion worker computing device(s) 126 can provide one or moresuggested set of parameters. For example, the suggestion workercomputing device(s) 126 can implement one or more black-box optimizers126 to generate the new suggestions. The one or more black-boxoptimizers 126 can implement any of the example black-box optimizationtechniques described above.

The early stopping computing device(s) 128 can perform one or more earlystopping techniques to determine whether to stop an evaluation of aTrial that is in progress. For example, example early stoppingtechniques are described above in sections 3.1 and 3.2.

The evaluation worker computing device(s) 130 can evaluate a suggestedset of parameters and, in response, provide a result. For example, theresult can be an evaluation of an objective function for a suggested setof parameters. In some implementations, the evaluation worker computingdevice(s) 130 can be provided and/or owned by the user. In otherimplementations, the evaluation worker computing device(s) 130 areprovided as a managed service. In some implementations in whichsuggested Trials can be evaluated offline (e.g., through manual orphysical evaluation), the evaluation worker computing device(s) 130 arenot used or included.

11. EXAMPLE METHODS

FIG. 13 depicts a flow chart diagram of an example method 1300 toperform black-box optimization according to example embodiments of thepresent disclosure.

At 1302, a computing system obtains a best observed set of values. Forexample, the best observed set of values can be retrieved from a memory.The best observed set of values can include a value for each of one ormore adjustable parameters.

In some implementations, at a first instance of 1302, the best observedset of values can simply be set equal to a first suggested set ofvalues. For example, the first suggested set of values can simply be arandom selection from a feasible parameter space for the one or moreadjustable parameters.

At 1304, the computing system determines whether to perform a randomsampling technique or a ball sampling technique. In someimplementations, the determination made at 1304 can be probabilistic.For example, in some implementations, determining whether to perform therandom sampling technique or the ball sampling technique at 1304 caninclude determining whether to perform the random sampling technique orthe ball sampling technique according to a predefined probability. Forexample, the predefined probability can be a user-defined probability.

When the probability is greater than zero, method 1300 can, in at leastsome iterations, investigate uniformly randomly sampled points. This canbe used to handle multiple minima, and can guarantee that the worst-caseperformance cannot be much worse than Random Search.

In some implementations, the predefined probability can change (e.g.,adaptively change) over a number of iterations of the method 1300. Forexample, the predefined probability can increasingly lead to selectionof the ball sampling technique at 1304 as the number of iterationsincreases.

If it is determined at 1304 that the random sampling technique should beperformed, then method 1300 can proceed to 1306. At 1306, the computingsystem performs the random sampling technique to obtain a new suggestedset of values. For example, the random sampling technique can includeselecting a random sample from the feasible parameter space for the oneor more adjustable parameters.

However, referring again to 1304, if it is determined at 1304 that theball sampling technique should be performed, then method 1300 canproceed to 1308. At 1308, the computing system performs the ballsampling technique to obtain a new suggested set of values.

As one example ball sampling technique that can be performed, FIG. 14depicts a flow chart diagram of an example method 1400 to perform a ballsampling technique according to example embodiments of the presentdisclosure.

At 1402, a computing system determines a radius for a ball. In someimplementations, at 1402, the radius can be selected from a geometricseries of possible radii. For example, the radius can be selected atrandom from the geometric series of radii. In some implementations, anupper limit on the geometric series of radii can be dependent on thediameter of the dataset, a resolution of the dataset, and/or adimensionality of an objective function.

In some implementations, determining the radius for the ball at 1402 caninclude determining the radius based at least in part on a user-definedresolution term. As one example, determining the radius for the ball at1402 can include randomly sampling the radius from a distribution ofavailable radii that has a minimum equal to the user-defined resolutionterm. In some implementations, determining the radius for the ball at1402 can include randomly sampling the radius from a distribution ofavailable radii that has a maximum that is based at least in part on adiameter of the feasible set of values for the one or more adjustableparameters.

At 1404, the computing system generates the ball that has the radiusaround the best observed set of values. At 1406, the computing systemdetermines a random sample from within the ball.

At 1408, the computing system projects the random sample from within theball onto the feasible set of values for one or more adjustableparameters. At 1410, the computing system selects the projection of therandom sample onto the feasible set of values as the suggested set ofvalues.

Referring again to FIG. 13 , having obtained a new suggested set ofvalues at 1306 or 1308, next at 1310, the computing system provides thesuggested set of values for evaluation.

At 1312, the computing system receives a new result obtained throughevaluation of the suggested set of values. At 1314, the computing systemcompares the new result to a best observed result obtained throughevaluation of the best observed set of values and sets the best observedset of values equal to the suggested set of values if the new resultoutperforms the best observed result.

At 1316, the computing system determines whether to perform additionaliterations. The determination at 1316 can be made according to a numberof different factors. In one example, iterations are performed until aniteration counter reaches a predetermined threshold. In another example,iteration-over-iteration improvement (e.g., |previous best result−newresult|) can be compared to a threshold value. The iterations can bestopped when the iteration-over-iteration improvement is below thethreshold value. In yet another example, the iterations can be stoppedwhen a certain number of sequential iteration-over-iterationimprovements are each below the threshold value. Other stoppingtechniques can be used as well.

If it is determined at 1316 that additional iterations should beperformed, then method 1300 returns to 1304. In such fashion, newsuggested sets of values can be iteratively produced and evaluated.

However, if it is determined at 1316 that additional iterations shouldnot be performed, then method 1300 proceeds to 1318. At 1318, thecomputing system provides the best observed set of values as a result.

12. ADDITIONAL DISCLOSURE

The technology discussed herein makes reference to servers, databases,software applications, and other computer-based systems, as well asactions taken and information sent to and from such systems. Theinherent flexibility of computer-based systems allows for a greatvariety of possible configurations, combinations, and divisions of tasksand functionality between and among components. For instance, processesdiscussed herein can be implemented using a single device or componentor multiple devices or components working in combination. Databases andapplications can be implemented on a single system or distributed acrossmultiple systems. Distributed components can operate sequentially or inparallel.

While the present subject matter has been described in detail withrespect to various specific example embodiments thereof, each example isprovided by way of explanation, not limitation of the disclosure. Thoseskilled in the art, upon attaining an understanding of the foregoing,can readily produce alterations to, variations of, and equivalents tosuch embodiments. Accordingly, the subject disclosure does not precludeinclusion of such modifications, variations and/or additions to thepresent subject matter as would be readily apparent to one of ordinaryskill in the art. For instance, features illustrated or described aspart of one embodiment can be used with another embodiment to yield astill further embodiment. Thus, it is intended that the presentdisclosure cover such alterations, variations, and equivalents.

In particular, although FIGS. 13 and 14 respectively depict stepsperformed in a particular order for purposes of illustration anddiscussion, the methods of the present disclosure are not limited to theparticularly illustrated order or arrangement. The various steps of themethods 1300 and 1400 can be omitted, rearranged, combined, and/oradapted in various ways without deviating from the scope of the presentdisclosure.

1.-22. (canceled)
 23. A computer-implemented method for use inoptimization of parameter values for machine-learning models, the methodcomprising: receiving, by one or more computing devices, one or moreprior evaluations of performance of a machine learning model, the one ormore prior evaluations being respectively associated with one or moreprior variants of the machine-learning model, the one or more priorvariants of the machine-learning model each having been configured usinga different set of adjustable parameter values; utilizing, by the one ormore computing devices, an optimization algorithm to generate asuggested variant of the machine-learning model based at least in parton the one or more prior evaluations of performance and the associatedset of adjustable parameter values, the suggested variant of themachine-learning model being defined by a suggested set of adjustableparameter values; and performing, by the one or more computing devices,transfer learning to obtain initial values for one or more adjustableparameters of the machine-learning model based on the one or more priorvariants of the machine-learning model.
 24. The computer-implementedmethod of claim 23, wherein the one or more prior variants of themachine-learning model comprise a plurality of previously optimizedmachine learned models.
 25. The computer-implemented method of claim 24,wherein performing, by the one or more computing devices, transferlearning comprises: identifying, by the one or more computing devices,the plurality of previously optimized machine learned models, whereinthe plurality of previously optimized machine learned models areorganized in a sequence; and building, by the one or more computingdevices, a plurality of Gaussian Process regressors respectively for theplurality of previously optimized machine learned models.
 26. Thecomputer-implemented method of claim 25, wherein the Gaussian Processregressor for each previously optimized machine learned model is trainedon one or more residuals relative to the Gaussian Process regressor forthe previous previously optimized machine learned model in the sequence.27. The computer-implemented method of claim 25, wherein the sequence isin temporal order based on when the plurality of previously optimizedmachine learned models were performed.
 28. A computer system operable tosuggest parameter values for machine-learned models, the computer systemcomprising: a database that stores one or more results respectivelyassociated with one or more sets of parameter values for one or moreadjustable parameters of a machine-learned model, the result for eachset of parameter values comprising an evaluation of the machine-learnedmodel constructed with such set of parameter values for the one or moreadjustable parameters; one or more processors; and one or morenon-transitory computer-readable media that store instructions that,when executed by the one or more processors, cause the computer systemto perform operations, the operations comprising: performing one or moreblack box optimization techniques to generate a suggested set ofparameter values for the one or more adjustable parameters of themachine-learned model based at least in part on the one or more resultsand the one or more sets of parameter values respectively associatedwith the one or more results; and performing transfer learning to obtaininitial parameter values for the one or more adjustable parameters. 29.The computer system of claim 28, wherein the operations furthercomprise: accepting an adjustment to the suggested set of parametervalues from a user, the adjustment comprising at least one change to thesuggested set of parameter values to form an adjusted set of parametervalues; receiving a new result obtained through evaluation of themachine-learned model constructed with the adjusted set of parametervalues; and associating the new result and the adjusted set of parametervalues with the one or more results and the one or more sets ofparameter values in the database.
 30. The computer system of claim 29,wherein the operations further comprise: generating a second suggestedset of parameter values for the one or more adjustable parameters of themachine-learned model based at least in part on the new result for theadjusted set of parameter values.
 31. The computer system of claim 28,wherein performing transfer learning comprises: identifying a pluralityof previously studied machine-learned models, the plurality ofpreviously studied machine-learned models organized in a sequence; andbuilding a plurality of Gaussian Process regressors respectively for theplurality of previously studied machine-learned models, wherein theGaussian Process regressor for each previously studied machine-learnedmodel is trained on one or more residuals relative to the GaussianProcess regressor for a previous previously studied machine-learnedmodel in the sequence.
 32. The computer system of claim 31, wherein thesequence is in temporal order based on when the plurality of previouslystudied machine-learned models were performed.
 33. The computer systemof claim 28, wherein the one or more adjustable parameters of themachine-learned model comprises one or more adjustable hyperparametersof the machine-learned model.
 34. The computer system of claim 28,wherein the operations further comprise performing a plurality of roundsof generation of suggested sets of parameter values using at least twodifferent black box optimization techniques.
 35. The computer system ofclaim 34, wherein the operations further comprise automatically changingblack box optimization techniques between at least two of the pluralityof rounds of generation of suggested sets of parameter values.
 36. Thecomputer system of claim 34, wherein the at least two different blackbox optimization techniques are stateless so as to enable switchingbetween black box optimization techniques between at least two of theplurality of rounds of generation of suggested sets of parameter values.37. The computer system of claim 28, wherein the operations furthercomprise: performing a plurality of rounds of generation of suggestedsets of parameter values; and receiving a change to a feasible set ofvalues for at least one of the one or more adjustable parameters of themachine-learned model between at least two of the plurality of rounds ofgeneration of suggested sets of parameter values.
 38. The computersystem of claim 28, wherein the operations further comprise providingfor display a parallel coordinates visualization of the one or moreresults and the one or more sets of parameter values for the one or moreadjustable parameters.
 39. A computer-implemented method to suggestparameter values for machine-learned models, the method comprising:receiving, by the one or more computing devices, one or more resultsrespectively associated with one or more sets of parameter values forone or more adjustable parameters of a machine-learned model, the resultfor each set of parameter values comprising an evaluation of themachine-learned model constructed with such set of parameter values forthe one or more adjustable parameters; generating, by the one or morecomputing devices, a suggested set of parameter values for the one ormore adjustable parameters of the machine-learned model based at leastin part on the one or more results and the one or more sets of parametervalues respectively associated with the one or more results; andperforming transfer learning to obtain initial parameter values for theone or more adjustable parameters.
 40. The computer-implemented methodof claim 39, further comprising: receiving, by the one or more computingdevices, an adjustment to the suggested set of parameter values from auser, the adjustment comprising at least one change to the suggested setof parameter values to form an adjusted set of parameter values;receiving, by the one or more computing devices, a new result associatedwith the adjusted set of parameter values; and associating, by the oneor more computing devices, the new result and the adjusted set ofparameter values with the one or more results and the one or more setsof parameter values.
 41. The computer-implemented method of claim 40,further comprising: generating, by the one or more computing devices, asecond suggested set of parameter values for the one or more adjustableparameters of the machine-learned model based at least in part on thenew result for the adjusted set of parameter values.
 42. Thecomputer-implemented method of claim 39, wherein the one or moreadjustable parameters of the machine-learned model comprises one or moreadjustable hyperparameters of the machine-learned model.